Categories
Computational Attention Insights

Applications of Saliency Models – Part Three

Catch up on Parts One and Two.
Applications based on abnormality processing

The third category of attention-based applications concerns abnormality processing. Some applications go further than the use of the simple detection of the areas of interest. They use comparisons between the areas on the saliency maps. Application domains such as robotics or advertisement highly benefit from this category of applications.

Robotics is a very large domain of application with various needs. There are three research axes where robots can take advantage from saliency models: 1) image registration and landmarks extraction, 2) object recognition, and 3) robots action guidance.

An important need of a robot is to know where it is located. For this aim, the robot can use the data from its sensors to find landmarks (salient features extraction) and register images taken at different times (salient features comparison) to build a model of the scene. The general process of real-time building of a view of the scene is called Simultaneous Localization and Mapping (SLAM). Saliency models can help a lot in the extraction of more stable landmarks from images which can be more robustly compared [25]. Those techniques imply first the computation of saliency maps, but the results are not used directly: they need to be further processed (especially comparisons of salient areas).

Another important need of robots after they establish the scene, is to recognize the objects which are present in this scene and which might be interesting to interact with. Two steps are needed to recognize objects. First of all, the robot needs to detect the object in a scene. For this goal saliency models can help a lot as they can provide information about proto-objects [26] or areas objectness [27]. When objects are detected, they need to be recognized. In this area the main approach is to 1) extract features (SIFT, SURF or any others) from the object 2) filter the features based on a saliency map 3) perform the recognition based on a classifier (such as a SVM or others). Papers like [28] or [29] apply this technique which let a computer drastically decrease the number of needed keypoints to perform the object recognition. Another approach was used in [30] or [31]. Here the features which are mostly present in the searched object and not present in the surroundings are learned and this learning phase provides a new set of weights for bottom-up attention models. In this way, the features which are the most discriminant in the searched object will get the higher response in the final saliency map. A third approach can be found in [32] where relative position of salient points (called cliques) are used for image recognition.

Once robots know where they are (attentive visual SLAM) and they recognize objects around them (attentive object recognition), they need to decide what to do next. One of the decisions they need to make is to know where to look next and this decision is obviously taken based on visual attention. Several robots implement multi-modal attention like the iCub robot. They combine visual and audio saliency in an egosphere and this is used to point the gaze on the next location. An interesting survey on attention for interactive robots can be found in [33].

Another domain is is also part of this abnormal region processing category of applications: visual communication optimization. Marketing optimization can be applied to a large amount of practical cases such as: web sites, advertisement, product placement in supermarkets, signage, 2D and 3D objects placement in galleries.

Among the different applications of automatic saliency computation, the marketing and communication optimization is probably one of the closest to market. As it is possible to predict an image attention map, which is a map of the probability that people attend each pixel of the image, it is possible to predict where people are likely to look on a marketing material like an advertisement or a website. Attracting customer attention is the first step of the process of attracting people interest, induce desire and need for the product and finally push the client to buy it.

Feng-GUI [34] is an Israeli company mainly focusing on web pages and advertising optimization even if the algorithm is also capable to analyze video sequences. AttentionWizzard [35] is a US company mainly focusing on web pages. There are few a hints on the used algorithm, but it uses bottom-up features like: color differences, contrast, density, brightness and intensity, edges and intersections, length and width, curves and line orientations. Top-down features include face detection, skin color and text (especially big text) detection. 3M VAS [36] is the only big international player in this field. Very few details are given on the used algorithm, but it is also capable to provide video saliency. They provide attention maps for web pages optimization, but also advertisement with static images or videos, packaging or in-store merchandising. Eyequant [37] is a German company specialized in website optimization. Their algorithm use extensive eye-tracking tests to train the algorithm and make it closer to real eye-tracking for a given task. All those companies claim around 90 % accuracy for the first 3/5 viewing seconds [38]. They base their claim on different comparison between their algorithm and several existing databases using several ROC metrics. They always compare the results with the maximum ROC score obtained by the human users. Nevertheless, for real-life images and for given tasks and emotion-based communication, this accuracy dramatically drops but still remains usable.

With more and more 3D objects which are created, manipulated, sold or even printed, 3D saliency is a very promising future research direction. The main idea is to compute the saliency score of each view of a 3D model: the best viewpoint is the one where the total object saliency is maximized [39]. Mesh saliency was introduced based on adapting to the mesh structure concepts for 2D saliency [40]. The notion of viewpoint and mesh simplification are also related through the use of mesh saliency [41]. While the best viewpoint application can be used for computer graphics or even 3D mesh compression, marketing is one of the targets of this research topic: more and more 3D objects are shown even on internet and the question of how to display them in an optimal way is very interesting in marketing.

Conclusion

During the last two decades, significant progresses have been made in the area of visual attention.
Regarding the applications, 3 categories taxonomy is proposed here:

  • Abnormality detection: use the most salient areas detection.
  • Normality detection: use the less salient areas detection.
  • Abnormality processing: compare and further process the most salient areas.

This categories let us simplify and classify a very long list of applications which can benefit from attention models. We are just at the early stages of the use of saliency maps into computer vision applications. Nevertheless, the number of already existing applications shows a promising avenue for saliency models in improving existing applications, and for the creation of new applications. Indeed, several factors are nowadays turning saliency computation from labs to industry:

  • The models accuracy drastically increased in two decades both concerning bottom-up saliency and top-down information and learning.
  • The models working both on videos and images are more and more numerous andprovide more and more realistic results. New models including audio signals and 3D data are released and are expected to provide convincing results in the near future.
  • The combined enhancement of computing hardware and algorithms optimization led to real-time or almost real-time good quality saliency computation.

References:
25. Frintrop, S. and Jensfelt, P. (2008) Attentional landmarks and active gaze control for visual slam. Robotics, IEEE Transactions on, 24 (5), 1054–1065.
26. Walther, D. and Koch, C. (2006) Modeling attention to salient proto-objects. Neural networks, 19 (9), 1395–1407.
27. Alexe, B., Deselaers, T., and Ferrari, V. (2010) What is an object?, in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, pp. 73–80.
28. Zdziarski, Z. and Dahyot, R. (2012) Feature selection using visual saliency for content-based image retrieval, in Signals and Systems Conference (ISSC 2012), IET Irish, IET, pp. 1–6.
29. Awad, D., Courboulay, V., and Revel, A. (2012) Saliency filtering of sift detectors: application to cbir, in Advanced Concepts for Intelligent Vision Systems, Springer, pp. 290–300.
30. Navalpakkam, V. and Itti, L. (2006) An integrated model of top-down and bottom-up attention for optimizing detection speed, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, IEEE, vol. 2, pp. 2049–2056.
31. Frintrop, S., Backer, G., and Rome, E. (2005) Goal-directed search with a top-down modulated computational attention system, in Pattern Recognition, Springer, pp. 117–124.
32. Stentiford, F. and Bamidele, A. (2010) Image recognition using maximal cliques of interest points, in Image Processing (ICIP), 2010 17th IEEE International Conference on, IEEE, pp. 1121–1124.
33. Ferreira, J.F. and Dias, J. (2014) Attentional mechanisms for socially interactive robots–a survey. Autonomous Mental Development, IEEE Transactions on, 6 (2), 110–125.
34. Feng gui website proposes automatic saliency maps for marketing material. URL http://www.feng-gui.com/.
35. Attention wizzard website proposes automatic saliency maps for marketing material. URL https: //www.attentionwizard.com/.
36. 3m vas website proposes automatic saliency maps for marketing material. URL http://solutions.3m.com/wps/portal/3M/en_US/VAS-NA/VAS/.
37. Eyequant website proposes automatic saliency maps for marketing material. URL http://www.eyequant.com/
38. Page containing the 3m vas studies showingalgorithm accuracy in general and in a marketing framework. URL http://solutions.3m.com/wps/portal/3M/en_US/VAS-NA/VAS/eye-tracking-software/eye-tracking-studies/.
39. Takahashi, S., Fujishiro, I., Takeshima, Y., and Nishita, T. (2005) A feature-driven approach to locating optimal viewpoints for volume visualization, in Visualization, 2005. VIS 05. IEEE, IEEE, pp. 495–502.
40. Lee, C.H., Varshney, A., and Jacobs, D.W. (2005) Mesh saliency, in ACM transactions on graphics (TOG), vol. 24, ACM, vol. 24, pp. 659–666.
41. Castelló, P., Chover, M., Sbert, M., and Feixas, M. (2014) Reducing complexity in polygonal meshes with view-based saliency. Computer Aided Geometric Design, 31 (6), 279–293.

Categories
Computational Attention Insights

Applications of Saliency Models – Part Two

Missed the Part One? We’ve got you covered.

Applications based on normality detection

In this section we focus on a second category of applications based on the locations having the lowest saliency scores. Those areas correspond with repeating and less informative regions, which might be easily compressed.

Compression is the process of converting a signal into a format that takes up less storage space or transmission bandwidth. The classical compression methods tend to distribute the coding resources evenly in an image. On the contrary, attention-based methods encode visually salient regions with high priority, while reating less interesting regions with low priority. The aim of these methods is to achieve compression without significant degradation of perceived quality.

In [1], a saliency map for each frame of a video sequence is computed and a smoothing filter is applied to all non-salient regions. Smoothing leads to higher spatial correlation, a better prediction efficiency of the encoder, and therefore a reduced bitrate of the encoded video. An extension of [1], uses a similar neurobiological model of visual attention to generate a saliency map [2]. The most salient locations are used to generate a so-called guidance map which is used to guide the bit allocation. Using the bit allocation model of [2], a scheme for attention video compression has been suggested by [3]. This method is based on visual saliency propagation (using motion vectors), to save computational time. More recently, attention-based image compression patents like [4] has been accepted, which also shows that compression algorithms are more and more efficient in real-life applications and become close to reach the market.

Compression aims in reducing the amount of data in a signal. A usual approach consist in modifying the coding rate, but other approaches can also reduce the amount of data in the signal by cropping or resizing the signal. An obvious idea which drastically compresses an image is of course to decrease its size. This size decrease can be brutal (zoom on a region and the rest of the image is discarded) or softer (the resolution of the context of the region of interest is decreased but not fully discarded).

The authors in [5] use Itti algorithm to compute the saliency map [6], that serves as a basis to automatically delineate a rectangular cropping window. The Self-Adaptive Image Cropping for Small Displays [7] is based on an Itti and Koch bottom-up attention algorithm but also on top-down considerations as face detection or skin color. According to a given threshold, the region is either kept or eliminated. A completely automatic solution to create thumbnails according to the saliency distribution or the cover rate is presented by [8]. An algorithm proposed in [9] starts by adaptively partition the input image into number of strips according to a combined map which contains both gradient information and visual saliency. The methods of intelligent perceptual zooming based on saliency algorithms become more and more interesting with the advances in saliency maps computation in terms of both real-time and spatio-temporal cues integration. Even big companies as Google [10] become more and more involved in developing applications based on perceptual zooms. The idea is to generalize the perceptual zoom for images and videos and keep the temporal coherence of the zoomed image even in case of objects of interest which might brutally appear in the image far from the previous zoom area.

Perceptual zoom does not always preserve the image structure. To keep the image structure intact several methods exist: warping and seam carving. Those methods are also used to provide data ßummarization”.

Warping is an operation that maps a position in a source image to a position in a target image by a spatial transformation. This transformation could be a simple scaling transformation [11].A retargeting method based on global energy optimization is detailed in [12] and extended to combine an uniform sampling and a structure-aware image representation [13]. A warping method which uses the grid mesh of quads to retarget the images is defined in [14]. The method determines an optimal scaling factor for regions with high content importance as well as for regions with homogeneous content which will be distorted. A significance map is computed based on the product of the gradient and the saliency map. [15] proposes an extended significance measurement to preserve shapes of both visually salient objects and structure lines while minimizing visual distortions.

The other method for image retargeting is seam carving. Seam carving [16] allows to retarget the image thanks to an energy function which defines the pixels importance. The most classical energy function is the gradient map, but other functions can be used such as entropy, histograms of oriented gradients, or saliency maps [17]. For spatio-temporal images, [18] propose to remove 2D seam manifolds from 3D space-time volumes by replacing dynamic programming method with graph cuts optimization to find the optimal seams. A saliency-based spatio-temporal seam-carving approach with much better spatio-temporal continuity than [18] is proposed by [19]. In [20], the authors describe a saliency map which takes more into account the context and proposes to apply it to seam carving. Interestingly, recent papers as [21] propose to mix seam carving and warping techniques.

Summarization of images or videos is a term which is similar to retargeting. It might be based on cropping [22]. It might also be based on carving as in [23]. The main purpose is to provide a relevant summary of a video or an image. In [24] the authors used video summarization to provide a mashup of several videos into a unique pleasant video containing the important sequences of all the concatenated videos.

References:

1. Itti, L. (2004) Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing, 13 (10), 1304–1318, doi:.834657.

2. Li, Z., Qin, S., and Itti, L. (2011) Visual attention guided bit allocation in video compression. Image and Vision Computing, 29 (1), 1 – 14, doi:10.1016/j.imavis.2010.07.001. URL http://www.sciencedirect.com/science/article/pii/ S0262885610001083.

3. Gupta, R. and Chaudhury, S. (2011) A scheme for attentional video compression. Pattern Recognition and Machine Intelligence, 6744, 458–465.

4. Zund, F., Pritch, Y., Hornung, A.S., and Gross, T. (2013), Content-aware image compression method. US Patent App. 13/802,165.

5. Suh, B., Ling, H., Bederson, B.B., and Jacobs, D.W. (2003) Automatic thumbnail cropping and its effectiveness., in Proceedings of the 16th annual ACM symposium on User interface software and technology (UIST), pp. 95–104.

6. Itti, L. and Koch, C. (2001) Computational modelling of visual attention. Nature Reviews Neuroscience, 2 (3), 194–203.

7. Ciocca, G., Cusano, C., Gasparini, F., and Schettini, R. (2007) Self-adaptive image cropping for small displays. IEEE Transactions on Consumer Electronics, 53 (4), 1622–1627.

8. Le Meur, O., Le Callet, P., and Barba, D. (2007) Construction d’images miniatures avec recadrage automatique basé sur un modéle perceptuel bio-inspiré, in Traitement du signal, vol. 24(5), vol. 24(5), pp. 323–335.

9. Zhu, T., Wang, W., Liu, P., and Xie, Y. (2011) Saliency-based adaptive scaling for image retargeting, in Computational Intelligence and Security (CIS), 2011 Seventh International Conference on, pp. 1201–1205, doi:10.1109/CIS.2011.266.

10. Grundmann, M. and Kwatra, V. (2014), Methods and systems for video retargeting using motion saliency. URL http://www.google.com/patents/US20140044404, uS Patent App. 14/058,411.

11. Liu, F. and Gleicher, M. (2005) Automatic image retargeting with fisheye-view warping, in Proceedings of User Interface Software Technologies (UIST). URL http://graphics.cs.wisc.edu/Papers/2005/LG05.

12. Ren, T., Liu, Y., and Wu, G. (2009) Image retargeting using multi-map constrained region warping, in ACM Multimedia, pp. 853–856.

13. Ren, T., Liu, Y., and Wu, G. (2010) Rapid image retargeting based on curve-edge grid representation, in ICIP, pp. 869–872.

14. Wang, Y.S., Tai, C.L., Sorkine, O., and Lee, T.Y. (2008) Optimized scale-and-stretch for image resizing. ACM Trans. Graph. (Proceedings of ACM SIGGRAPH ASIA, 27 (5).

15. Lin, S.S., Yeh, I.C., Lin, C.H., and Lee, T.Y. (2013) Patch-based image warping for content-aware retargeting. Multimedia, IEEE Transactions on, 15 (2), 359–368, doi:10.1109/TMM.2012.2228475.

16. Avidan, S. and Shamir, A. (2007) Seam carving for content-aware image resizing. ACM Trans. Graph., 26 (3), 10.

17. Vaquero, D., Turk, M., Pulli, K., Tico, M., and Gelf, N. (2010) A survey of image retargeting techniques, in SPIE Applications of Digital Image Processing.

18. Rubinstein, M., Shamir, A., and Avidan, S. (2008) Improved seam carving for video retargeting. ACM Transactions on Graphics (SIGGRAPH), 27 (3), 1–9.

19. Grundmann, M., Kwatra, V., Han, M., and Essa, I. (2010) Discontinuous seam-carving for video retargeting, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 569–576, doi:10.1109/CVPR.2010.5540165.

20. Goferman, S., Zelnik-Manor, L., and Tal, A. (2012) Context-aware saliency detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34 (10), 1915–1926.

21. Wu, L., Cao, L., Xu, M., and Wang, J. (2014) A hybrid image retargeting approach via combining seam carving and grid warping. Journal of Multimedia, 9 (4). URL http://ojs.academypublisher.com/index.php/jmm/article/view/jmm0904483492.

22. Ejaz, N., Mehmood, I., Sajjad, M., and Baik, S.W. (2014) Video summarization by employing visual saliency in a sufficient content change method. International Journal of Computer Theory and Engineering, 6 (1), 26.

23. Dong, W., Zhou, N., Lee, T.Y., Wu, F., Kong, Y., and Zhang, X. (2014) Summarization-based image resizing by intelligent object carving. Visualization and Computer Graphics, IEEE Transactions on, 20 (1), 1–1.

24. Zhang, L., Xia, Y., Mao, K., Ma, H., and Shan, Z. (2015) An effective video summarization framework toward handheld devices. Industrial Electronics, IEEE Transactions on, 62 (2), 1309–1316.

Categories
Computational Attention Insights

Applications of Saliency Models – Part One

Attention modeling: a huge range of applications.

The applications of saliency maps are numerous and they can occur in many domains. For some applications, the saliency maps and their analyses are the final goal, while for others saliency maps are only an intermediary step. We propose three categories of applications in order to make a classification.

The first category of applications directly takes advantage of the detection of surprising, thus abnormal areas in the signal. We can call this class of applications “Abnormality detection”. Surveillance or events/defects detection are examples of applications domains in this category.

The second category will focus more on the opposite of the first one: as the attention maps provide us with an idea about the surprising parts of the signal, one can deduce where the normal (homogenous, repetitive, usual, etc…) signal is. We will call this category “Normality modeling”. The main application domains are in signal compression or re-targeting.

Finally, the third application category is related to the surprising parts of the signal but will go further than a simple detection. This application family will be called “Abnormality processing” and it will need to compare and further process the most salient regions. Domains such as robotics, object retrieval or interfaces optimization can be found in this category.

Applications based on abnormality detection.

In this section, applications are related to surveillance or defect detection. Some authors took into account the concept of “usual motion” either by using accumulation of motion features from videos in given regions which provide a “normality” of the motion in those regions [3] or using more complex systems as Hidden Markov Models (HMMs) to predict future normal motion [4].

While abnormal motion has been mostly used for crowd scenes, some authors like in [9] provide models which work on any general scene containing motion. Some saliency models were used [10][11] with audio data to spot unusual sounds in classical contextual sounds like a gunshot in the middle of a metro station audio ambiance.

In [13], saliency models are used for defect detection and were applied first to automatic fruit grading. In [9], in addition to video surveillance, their model can also apply to static images and find generic defects on those images. Saliency models are applied for defect detection on a wide variety of applications such as the semiconductor manufacturing and electronic production [14], metallic surfaces [15] or wafer defects [16].

In this category, we could add the use of saliency in computer graphics [37] or quality metrics [49] where the abnormal regions of the image are used to optimize graphical representation or to provide different weight to the quality metric depending on the pixels. In the next chapter, we will see the two other categories of applications of saliency models in engineering: normality detection and abnormality processing.

References:
3. Mancas, M. and Gosselin, B. (2010) Dense crowd analysis through bottom-up and 12 top-down attention. Proc. of the Brain Inspired Cognitive Sytems (BICS).
4. Jouneau, E. and Carincotte, C. (2011) Particle-based tracking model for automatic anomaly detection, in Image Processing (ICIP), 13 2011 18th IEEE International Conference on, IEEE, pp. 513–516.
9. Boiman, O. and Irani, M. (2007) Detecting irregularities in images and in video. International Journal of Computer Vision, 74 (1), 17–31.
10. Couvreur, L., Bettens, F., Hancq, J., and Mancas, M. (2007) Normalized auditory attention levels for automatic audio surveillance, in Int. Conf. on Safety and Security Engineering.
11. Mancas, M., Couvreur, L., Gosselin, B., Macq, B. et al. (2007) Computational attention for event detection, in Proc. Fifth International Conf. Computer Vision Systems.
13. Mancas, M., Unay, B., Gosselin, B., and Macq, D. (2007) Computational attention for defect
localisation, in Proceedings of ICVS Workshop on Computational Attention & Applications.
14. Bai, X., Fang, Y., Lin, W., Wang, L., and Ju, B.F. (2014) Saliency-based defect detection in industrial images by using phase spectrum. Industrial Informatics, IEEE Transactions on, 10 (4), 2135–2145.
15. Bonnin-Pascual, F. and Ortiz, A. (2014) A probabilistic approach for defect detection based on saliency mechanisms, in Emerging Technology and Factory Automation (ETFA), 2014 IEEE, IEEE, pp. 1–4.
16. Mishne, G. and Cohen, I. (2014) Multi-channel wafer defect detection using diffusion maps, in Electrical & Electronics Engineers in Israel (IEEEI), 2014 IEEE 28th Convention of, IEEE, pp. 1–5.
37. Longhurst, P., Debattista, K., and Chalmers, A. (2006) A gpu based saliency map for high-fidelity selective rendering, in Proceedings of the 4th international conference on Computer graphics, virtual reality, visualisation and interaction in Africa, ACM, pp. 21–29.
49. Ninassi, A., Le Meur, O., Le Callet, P., and Barbba, D. (2007) Does where you gaze on an image affect your perception of quality? Applying visual attention to image quality metric, in Image Processing, 2007. ICIP 2007. IEEE International Conference on, vol. 2, vol. 2, pp. II –169 –II –172, doi:10.1109/ICIP.2007.4379119.

Categories
Computational Attention Insights

Attention in computer science – Part 2

In the previous part we mainly dealt with visibility models and static saliency models of attention. But the notion of computational attention could not remain only focused on static images and it developed in other modalities.

  1. Video saliency

Some still image models were simply extended to video. For example, Seo & Milanfar (2009) introduced the time dimension by replacing square spatial patches by 3D spatio-temporal cubic patches where the third dimension is the time. Itti’s model was also generalized with the addition of motion features and flickering to the initial spatial set of features containing luminance, color and orientations. Those models mainly show that important motion is salient. A question might be: what saliency models can bring more than a good motion detector? Models like Mancas et al. (2011) have developed a bottom-up saliency map to detect abnormal motion. The model exhibits promising results from a few moving objects to dense crowds with increasing performance. The idea is to show that motion is most of the time salient but within motion, some moving areas are more interesting than others.

  1. 3D saliency

3D saliency modeling is an emerging area of research, which was boosted by two facts: First, the arrival of affordable RGB-D cameras which provide both classical RGB images and a depth map describing pixels distance from the camera. This depth information is very important and it provides new features (curvature, compactness, convexity, …). The second event a higher availability of 3D models (used for example in 3D printing). 3D models are more easily available and libraries like PCL Aldoma et al. (2012) can handle 3D point clouds, convert formats and compute features from those point clouds. As for video, most of the 3D saliency models are extensions of still images models. Some use the 3D meshes based on Itti’s approach, others just add the depth as an additional feature while recent models are based on the use of point clouds. As 3D saliency models are mainly extensions of 2D models, depending on the extended model, the different features can be taken into account locally and/or globally on the 3D objects.

  1. Audio saliency

There are very few auditory attention models compared to visual attention models. One approach deals with the local context for audio signals. Kayser et al. (2005) computes auditorysaliencymapsbasedonItti’svisualmodel(1998). First, thesoundwaveisconverted to a time-frequency representation. Then three auditory features are extracted on different scalesandinparallel(intensity, frequencycontrast, andtemporalcontrast). Foreachfeature, the maps obtained at different scales are compared using a center-surround mechanism and normalized. Finally, a linear combination builds the saliency map which is then reduced to one dimension to be able to fit on the one-dimensional audio signal. Anotherapproachtocomputeauditorysaliencymapisbasedonfollowingthewell-established approach of Bayesian Surprise in computer vision (Itti & Baldi (2006)). An auditory surprise is introduced to detect acoustically salient events. First, a Short-Time Fourier transform (STFT) is used to calculate the spectrogram. The surprise is computed in the Bayesian framework.

  1. Top-down saliency

Top-down is endogenous information and comes from the inner world (information from memory, their related emotional level and also the task-related information). In practice, two main families of top-down information can be added to bottom-up attention models.

4.1       What is normal?

The first one mainly deals with learned normality which can come from the experience from the current signal if it is time varying, or from previous experience (tests, databases). Concerning still images, the “normal” gaze behavior can be learned from the “mean observer”. Eye-tracking techniques Judd et al. (2009) or mouse-tracking Mancas (2007) can be used on several users, and the average of their gaze on a set of natural images can be computed. The results show that, for natural images, the eye gaze is attracted by the center of the images. This observation for natural images is very different from more specific images which use a priori knowledge. Mancas (2009) showed using mouse tracking that gaze density is very different on a set of advertisements and on a set of websites. This is partly due to a priori knowledge that people have about those images (areas containing title, logo, menu). For video signals, it is also possible to accumulate in time motion patterns for each extracted feature to get a model of normality. After a given period of observation, the model can detect that in a given location moving objects are generally fast and going from left to right. If an object, at the same location, is slow and/or going from right to left, this is surprising given what was previously learned from the scene, thus attention will be directed to this object. For 3D signals, another information is the proximity of objects. A close object is more likely to attract attention as it is more likely to be the first that we will have to interact with.

4.2       Where are my keys?

While the previous section dealt with attention attracted by events which lead to situations which are not consistent with the knowledge acquired about the scene, here we focus on a second main top-down cue which is a visual task (“Find the keys!”). This task will also have a huge influence on the way the image is attended and it will imply object recognition (“Recognize the keys”) and object usual location (“they could be on the floor, but never on the ceiling”).

Object recognition can be achieved through classical methods or using points of interest (like SIFT, SURF …Bay et al. (2008)). Some authors integrated the notion of object recognition into the architecture of their model like Navalpakkam & Itti (2005). They extract the same features as for the bottom-up model, from the object and learn them. This learning step will provide weight modification for the fusion of the conspicuity maps which will lead to the detection of the areas which contain the same feature combination as the learned object. Another approach in adding top-down information is in providing with a higher weight the areas from the image which have a higher probability to contain the searched object. Several authors as Oliva et al. (2003) developed methods to learn objects’ location.

  1. Learning bottom-up and top-down together

Recently, learning the salient features becomes more and more popular: the idea is not to find the rare regions, but to find an optimal description of those rare regions which are already known from eye-tracking or mouse-tracking ground truth. The learning is based on deep neural networks, sparse coding and pooling based on large images datasets where the regions of interest are known. The most attended regions based on eye-tracking results are used to train classifiers which will extract the main features of these areas. The use of deep neural networks greatly improved those techniques which are now able to extract meaningful middle and high level features which can describe the best the salient regions Shen & Zhao (2014). Indeed, this learning step will find the classical bottom-up features in the first layers, but it will also add context, centred gaussian, object detection (faces, text) and recognition together. An issue with those methods is a loss of generality of the models which will work for given datasets, even if, the deep learning is able to cope with high variability in the case of general images for example.

  1. Attention in computer science

In computer science there are two families of models: some are based on feature visibility and others on the concept of saliency maps, the latter approach being the most prolific. For saliency-based bottom-up attention the idea is the same for all the models: find areas in the image which are the most surprising in a given context (local, global or normality-based). Finally a set of top-down features which can influence the saliency-based models are reviewed. Recently deep neural networks are used to integrate both bottom-up and top-down information in the same time.

  1. References

Aldoma, A., Marton, Z.-C., Tombari, F., Wohlkinger, W., Potthast, C., Zeisl, B., Rusu, R. B., Gedikli, S. & Vincze, M. (2012). Point cloud library, IEEE Robotics & Automation Magazine 1070(9932/12).

Bay, H., Ess, A., Tuytelaars, T. & Gool, L. V. (2008). Surf: Speeded up robust features, Computer Vision and Image Understanding (CVIU) 110(3): 346–359.

Itti, L. & Baldi, P. F. (2006). Modeling what attracts human gaze over dynamic natural scenes, inL.Harris&M.Jenkin(eds), Computational Vision in Neural and Machine Systems, Cambridge University Press, Cambridge, MA.

Judd, T., Ehinger, K., Durand & Torralba, A. (2009). Learning to predict where humans look, IEEE Inter. Conf. on Computer Vision (ICCV), pp. 2376–2383.

Kayser, C., Petkov, C., Lippert, M. & Logothetis, N. K. (2005). Mechanisms for allocating auditory attention: An auditory saliency map, Curr. Biol. 15: 1943–1947.

Mancas, M. (2007). Computational Attention Towards Attentive Computers, Presses universitaires de Louvain.

Mancas, M. (2009). “relative influence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.

Mancas, M., Riche, N. & J. Leroy, B. G. (2011). Abnormal motion selection in crowds using bottom-up saliency, IEEE ICIP.

Navalpakkam, V. & Itti, L. (2005). Modeling the influence of task on attention, Vision Research 45(2): 205–231.

Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.

Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL: http://www.journalofvision.org/content/9/12/15.abstract

Shen, C. & Zhao, Q. (2014). Learning to predict eye fixations for semantic contents using multi-layer sparse network, Neurocomputing 138: 61–68.

Categories
Computational Attention Insights

Attention in computer science – Part 1

Numediart Institute, Faculty of Engineering (FPMs), University of Mons (UMONS) Matei Mancas, 31 Bd. Dolez, 7000 Mons, Belgium

Idea and approaches. As we already saw, attention is a topic which was taken into account by philosophy first, it was than discussed by cognitive psychology and neuroscience and, only in the late nineties, attention modeling arrived in the domain of computer science and engineering. In this domain, two main approaches can be found. The first one is based on the notion of “saliency”, while the second one on the idea of “visibility”. In reality, the models based on saliency are by far more spread than the visibility models in computer science. The notion of “saliency” implies a competition between “bottom-up” or exogenous and “topdown” or endogenous information. The idea of bottom-up saliency maps is that the sight of people will direct to areas which, in some way, stand out from the background based on novel or rare features. This bottom-up saliency can be modulated by top-down information based on memory, emotions or goals. The eye movements (scan paths) can be computed from the saliency map which remains the same during eye motion: it is a global static attention (saliency) map which only provides, for each pixel, a probability to attract human gaze.

Visibility models. These models of human attention assume that people attend locations that maximize the information acquired by the eye (the visibility) to solve a given task (which can also be simply free viewing). In this case top-down information is naturally included in the notion of task along with the dynamic bottom-up information maximization. The eye movements are in this approach directly an output from the model and do not have to be inferred from a “saliency map” which is considered as a surface giving the posterior probability (following each fixation) that the target is at each scene location Geisler & Cormack (2011). Compared to other Bayesian frameworks, like the one of Oliva et al. (2003), visibility models have one main difference. The saliency map is dynamic: indeed visibility models make explicit the resolution variability of the retina (Figure 1): in that way an attention map is “re-computed” at each new fixation, as the feature visibility changes at each of these fixations. Tatler (2007) introduces a tendency of the eye gaze to stay in the middle of the scene to maximize the visibility over the image (which reminds the centered preference for natural images also called centered Gaussian bias.

MANCAS 1

Figure 1: Depending on the eye fixation position, visibility thus feature extraction is different. Adapted from images by Jeff Perry.

The visibility models are much more used in the case of strong tasks (like Legge et al. (2002) who proposed a visibility model capable to predict the eye fixations during the task of reading) and few of them are applied to free viewing which is considered as a week task Geisler & Cormack (2011).

Saliency approaches: bottom-up methods. While visibility models are more used in cognitive sciences and with strong tasks, in computer science, bottom-up approaches use features extracted only once from the signal independently from the eye fixations mainly for free-viewing. Features are extracted from the image, such as luminance, color, orientation, texture, objects relative position or even simply neighborhoods or patches. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for contrasted, rare, surprising, novel, worthy to learn, less compressible, maximizing the information areas. All those definitions are actually synonyms and they all amount to searching for some unusual features in a given spatial context. In the following, we provide examples of contexts used for still images to obtain a saliency map. This saliency map can be visualized as a heatmap where hot colors represent pixels with a higher probability to attract human gaze (Figure 2).

MANCAS 2

Figure 2: Left: initial image. Right: superimposed saliency heatmap on the initial image. The saliency map is static and gives an overview of where the eye is likely to attend.

Saliency methods for still images. The literature is very active concerning still images saliency models. Those models have various implementations and technical approaches even if initially they all derive from the same idea. It is not the purpose here to provide a review of all those models, but we instead propose a taxonomy to classify those models. We structure this taxonomy of saliency methods on the context that those methods take into account to exhibit image novelty. In this framework, there are three classes of methods.

The first one focuses on pixel’s surroundings: here a pixel, a group of pixels or a patch is compared with its surroundings at one or several scales. The main idea is to compute visual features at several scales in parallel, to apply center-surround inhibition, combination into conspicuity maps (one per feature) and finally to fuse them into a single saliency map. There are a lot of models derived from this approach which mainly use local center-surround contrast as a local measure of novelty. A good example of this family of approaches is the Itti’s model Itti et al. (1998) which is the first implementation of the Koch and Ullman model. This implementation proved to be the first successful approach of attention computation by providing better predictions of the human gaze than chance or simple descriptors like entropy.

A second class of methods will use as a context the entire image and compare pixels or patches of pixels with other pixels or patches from other locations in the image but not necessarily in the surroundings of the initial patch. The idea can be divided in two steps. First, local features are computed in parallel from a given image. The second step measures the likeness of a pixel or a neighborhood of pixels to other pixels or neighborhoods within the image. A good example can be found in Seo & Milanfar (2009) which first proposes to use local regression kernels as features. Second it uses a nonparametric kernel density estimation for such features, which results in a saliency map of local “self-resemblance” measure. Mancas (2009) and Riche et al. (2013) focus on the entire image. These models are designed to detect saliency in the areas which are globally rare and locally contrasted. Boiman & Irani (2007) look for similar patches and relative positions of these patches in an image.

Finally, the third class of methods will take into account a context based on a model of what the normality should be: if things are not like they should be, this can be surprising, thus attract people attention. Achanta et al. (2009) proposed a very simple attention model: a distance is computed between a smoothed version of the input image and the average color vector of the input image. The average image is used as a kind of model of the image statistics: pixels which are far from those statistics are more salient. This model is mainly useful in salient objects detection. Another approach to “normality” can be found in Hou & Zhang (2007), where the authors proposed a spectral model that is independent of any features. The difference between the log-spectrum of the image and its smoothed log-spectrum (spectral residual) is reconstructed into a saliency map. Indeed, a smoothed version of the log-spectrum is closer to a a f1  decreasing log-spectrum template of normality as small variations are removed. This approach is almost as simple as Achanta et al. (2009) but more efficient in predicting eye fixations.

Towards video, audio or 3D signals and top-down attention. In the next parts we will focus on other kind of signals such as moving images (video), audio or even 3D signals. In addition, even if the top-down information is less modeled for saliency approaches, there is anyway an important literature linked to the topic which will also be detailed in the next parts.

 
References
Achanta, R., Hemami, S., Estrada, F. & Susstrunk, S. (2009). Frequency-tuned Salient Region Detection, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). URL: http://www.cvpr2009.org/
Boiman, O. & Irani, M. (2007). Detecting irregularities in images and in video, International Journal of Computer Vision 74(1): 17–31.
Geisler, W. S. & Cormack, L. (2011). Chapter 24: Models of Overt Attention, in The Oxford handbook of eye movements, Oxford University Press.
Hou, X. & Zhang, L. (2007). Saliency detection: A spectral residual approach, Proc. IEEE Conf. Computer Vision and Pattern Recognition CVPR ’07, pp. 1–8.
Itti, L., Koch, C. & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11): 1254 –1259.
Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.chips 2002: new insights from an idealobserver model of reading, Vision Research pp. 2219–2234.
Mancas, M. (2009). “relative influence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.
Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.
Riche, N., Mancas, M., Duvinage, M., Mibulumukini, M., Gosselin, B. & Dutoit, T. (2013). Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis, Signal Processing: Image Communication 28(6): 642–658.
Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL: http://www.journalofvision.org/content/9/12/15.abstract
Tatler, B. (2007). The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions, Journal of Vision 7.

Categories
Computational Attention Insights

How to measure attention?

There are a lot of ways to measure attention. Some, mainly in psychology, are more qualitative and use questionnaires and their interpretation. Some are quantitative but they focus on the participants feedback (button press, click, etc…) when they see/hear/sense a stimulus.

Here we focus on quantitative techniques which provide fine-grain information about the attentive responses. The attentive response can be either measured directly in the brain, or indirectly through the participants’ eye behavior. Only one of the techniques which are described here is based on participant active feedback: mouse tracking. This is because the mouse tracking feedback is very close to the one of the eye-tracking and this is an emerging approach of interest for the future: it needs less time, less money and provide more data than classical eye-tracking.

Eye-tracking: an indirect cue about covert attention

The use of an eye-tracker is probably the most widely used tool for attention measurement. The idea is to use a device which is able to precisely measure the eyes gaze which obviously only provide information concerning covert attention.

The eye-tracking technology highly evolved during time. Different technologies are described in [1]. One of the first techniques is the EOG (Electro-OculoGraphy). The idea is to measure the skin electric potential around the eye which give the eye direction relative to the head. This issue implies that for a complete eye-tracking system the head must either be attached to a still system or a head tracker system must be used in addition to the EOG. In order to get more precise results, special lenses can be used instead of EOG, but in this case the technique is more invasive and it also only provides the eye direction relative to the head and not the eye gaze as an intersection with a screen for example.

The technique that most of the current commercial and research solutions use is based on the video detection of pupil/corneal reflection. Indeed, an infra-red source sends the light towards the eyes. The light is reflected by the eye and the position of the reflection is used to compute the gaze direction.

While the technique is most of the time the same, the embodiment of the eye-tracker can be very different. The main eye-tracking manufacturers propose the system under different forms [2][3][4].

  1. Some eye-trackers are directly included into the screen which is used to present the data. This setup has the advantage of a very short calibration, but it can only be used with its own screen.
  2. Separate cameras need some additional calibration time but the tests can be done on any screen and even in a real scene by using a scene camera.
  3. The eye-tracking glasses can be used in a very ecological setup, even outside on real-life scenes. An issue of those systems is that it is not easy to aggregate the data from several viewers as the scene which is viewed is not the same. The aggregation needs a non-trivial registration of the scenes.
  4. Cheap devices begin to appear and quite precise cameras are sold less than 100 EUR [5] which is a fraction of the price of a professional eye-tracker. An issue with those eye-trackers is that they are sold with minimal software and it is often difficult to synchronize the stimuli and the related recorded data. Those eye-trackers are mostly used as real-time human-machine interaction devices. Nevertheless, open source projects exist which allow to record data from low cost eye-trackers like Ogama [6].
  5. Finally, webcam-based software is freely available [7]. They are able to provide good quality data and to be used remotely with existing webcams [8].

Mouse-tracking: the low-cost eye-tracking

If eye tracking is the most reliable ground truth in the study of covert visual attention, it needs a good practice for the operator, it has some mandatory constraints for the user (the head might be attached, the calibration process may be long), and it needs a complex system which has a certain cost.

A much simpler way to acquire data about visual attention may be the use of mouse tracking. The mouse can be precisely followed while an Internet browser is opened by using a client-side language like JavaScript. The mouse precise position on the screen can be either captured using a home-made code or some existing libraries like [9][10]. This technique may appear as not very reliable; however, all depends on the context of the experiment.

  1. The first case is the one where the experiment is hidden to the participant: the participant is unaware about the fact that the mouse motion is recorded. In this case the mouse motion is not accurate enough. Indeed there is no automatic following of the eye gaze by the hand even if a tendency of the hand (and consequently the mouse) to follow the gaze is visible. Sometimes the mouse is only used to scroll a page and the eyes are very far from the mouse pointer for example.
  2. The second case is the one where the participant is aware about the experiment and he has a task to follow. This can go from a simple “point the mouse where you look” instruction as in [11] with the first use of mouse tracking for saliency evaluation to more recent approaches as the one of SALICON in [12] where multi-resolution interactive pointing mimicking the fovea resolution is used to push people to point the mouse curser where they look.

In this second case where the participant is aware about his mouse motion tracking, the results of mouse tracking are very close to eye-tracking as shown by Egner and Scheier on their website [13]. However, some unconscious eye movements may be missed, but is this really an issue?

The main advantages of mouse tracking are low price and the complete transparency for the users (they just move a mouse pointer).

However, mouse tracking has several drawbacks:

  • The first place where the mouse pointer is located is quite important as the observer may look for the pointer. Should it be located outside the image or in the centre of the image? Ideally, the pointer should initially appear randomly in the image to avoid introducing a bias of the initial position of the pointer.
  • Mouse tracking only highlights areas which are consciously important for the observer. This is more a theoretical drawback as in practice, one should try to predict the conscious interesting regions.
  • The pointer hides the image region it overlaps, thus the pointer position is never on the important areas but very close to them. This drawback may be partially eliminated by the low-pass filter step performed after the mean of the whole observer set. It is also possible to make a transparent pointer as in [12].

EEG: Get the electric activity from the brain

The EEG technique (ElectroEncephaloGraphy) uses electrodes which are located on the participant scalp. Those electrodes amplify the electrical waves coming from the brain. An issue of this technique is that the skull and scalp attenuates those electrical waves.

While classical research setups have a high number of electrodes with manufacturers like [14][15], some low-cost commercial systems like Emotiv [16] are more compact and easy to install and calibrate. While the latter are easier to use, they are obviously less precise.

EEG studies provided interesting results as the modulation of the gamma band [17] during selective visual attention. Other papers [18] also provide cues about the alpha band modification during attentional shifts.

One very important cue about attention which can be measured using EEG is the P300 event-related potential.

The work of Näätänen et al. [19] in 1978 on the auditory attention provided evidences that the evoked potential has an improved negative response when the subject was presented with rare stimuli compared to frequent ones. This negative component is called the mismatch negativity (MMN), and it was observed in several experiments. The MMN occurs 100 to 200 ms after the stimuli, a time which is perfectly in the range of the pre-attentive attention phase.

Depending on the experiments, different auditory features were isolated: audio frequency [20], audio intensity [19][21][22], spatial origin [23], duration [24] and phonetic changes [25]. All these features were not salient alone, but saliency was induced by the rarity of each one of these features.

The study of the MMN signal for visual attention was led several times in conjunction with audio attention [26][27][28]. But a few experiments were made using only the visual stimuli. Crottaz-Herbette led in her thesis [29] an experiment in the same conditions as for auditory MMN in order to find out if a visual MMN really exists. The result was clearly positive with a high increase of the negativity of the evoked potential when seeing rare stimuli compared with the evoked potential when seeing frequent stimuli. The visual MMN occurs from 120 to 200 ms seconds after the stimulus. The 200 ms frontier strangely matches with the 200 ms needed to initiate a first eye movement, thus to initiate the “attentive” serial attentional mechanism. As for the audio MMN detection, no specific task was asked of the subject who only had to see the stimuli, this MMN component is thus pre-attentive unconscious and automatic.

This study and others [30] also suggest the presence of a MMN response for the somesthesic modality (touch, taste, etc…)

The MMN seems to be a universal component illustrating the brain reaction to an unconscious pre-attentive process. Any unknown stimulus (novel, rare) will be very salient as measured by P300 as the brain will try to know more about it. Rarity is the major engine of the attentional mechanism for visual, auditory and all the other signals acquired from our environment.

Functional imaging: fMRI

The MRI stands for Magnetic Resonance Imaging. The main idea behind this kind of imaging system is that human body is mainly made of water which is itself composed of hydrogen atoms composed of a single proton. Those protons have a magnetic moment (spin) which is randomly oriented most of the time. The MRI device will set up a very high magnetic field which will have as consequence to align the magnetic moment of the protons of the patient body. Radio Frequency (RF) impulsions orthogonal to the initial magnetic field push the protons to align to this new impulsion and they will align back to the initial magnetic field while releasing RF waves. Those waves are captured and they help in constructing an image where clear gray levels mean that there are more protons, therefore, more water in the body parts (like in fat for example) and a darker gray level reveal regions with less water like bones for example.

MRI is initially an anatomical imaging technique, but there is a functional version called fMRI using the BOLD approach. In this case a substance which has magnetic properties is injected into the blood. If a body part or a region of the brain is in its basal activity state, then the substance keeps its initial composition. If the blood pressure is higher with more oxygen (activated state), then the substance composition will change and the magnetic response to MRI will be much higher. In that way, when a region in the brain, for example, is activated, then the blood will have an increased flow and the activated state will push to a high response. fMRI imaging is thus capable of detecting the areas in the brain which are active and to become a great tool for neuroscientists which can visualize which area in the brain is activated during an attention-related patient exercise.

Functional imaging: MEG

MEG stands for MagnetoEncephaloGraphy. The idea is simple: while the EEG detects the electrical field which is heavily distorted when traversing the skull and skin, MEG detects the magnetic field induced by this electrical activity. The magnetic field has the advantage not to be influenced by the skin or the skull. While the idea is simple, in practice the magnetic field is very low which makes it very difficult to measure. This is why the MEG imaging is relatively new: the technological advances let the MEG be more effective based on SQUID (Superconducting Quantum Interference Devices). The magnetic field of the brain can induce electricity in a superconducting device which can be precisely measured. Modern devices have spatial resolutions of 2 millimetres and temporal resolutions of some milliseconds. Moreover, MEG images can be superimposed on MRI anatomic images which help in rapidly localise the main active areas. Finally, participants to MEG imaging can have a sit position which is more natural during exercises than the horizontal position of fMRI or PET scan.

Functional imaging: PET Scan

As for fMRI, PET scan (Positron Electron Tomography) is also a functional imaging and it can thus produce also a higher signal in case of brain activity. The main idea of PET scan is that the substance which is injected to the patient releases positrons (anti-electrons which are particles of the same properties as an electron but with positive charges). Those positrons will almost instantaneously meet an electron and have a very exo-energetic reaction (called annihilation). This annihilation will transform the whole mass of the two particles into energy and release to gamma photons in two opposite directions which will be detected by the scanner sensors. The substance which is injected will go and fixate on the areas of the brain which are the most active, which means that those areas will exhibit a high number of annihilations. As for fMRI, the PET scan let the neuroscientists know which areas of the brain are activated when the patient is performing an attention task.

Functional imaging and attention

Positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) have been extensively used to explore the functional neuroanatomy of cognitive functions. MEG imaging becomes to be used in the field as in [31]. In [32] a review of 275 PET and fMRI studies of attention type, perception, visual attention, memory, language, etc. are described. Depending of the setup and task a large variety of brain regions seem to be involved in attention and related functions (language, memory). This findings support again the idea that at the brain level, there are several attentions and their activity is largely distributed across almost all the brain. Attention goes from low-level to high level processing, from reflexes to memory and emotions and across all the human senses.

References:

[1] Duchowski, Andrew. Eye tracking methodology: Theory and practice. Vol. 373. Springer Science & Business Media, 2007.

[2] Tobii eye tracking technology, http://www.tobii.com/

[3] SMI eye tracking technology, http://www.smivision.com

[4] SR-Research eye tracking technology, http://www.sr-research.com/

[5] Eyetribe low cost eye-trackers, https://theeyetribe.com/

[6] Open source recording from several eye trackers, http://www.ogama.net/

[7] Open source eye-tracking for webcams, http://sourceforge.net/projects/haytham/

[8] Xu, Pingmei, et al. “TurkerGaze: Crowdsourcing Saliency with Webcam based Eye Tracking.” arXiv preprint arXiv:1504.06755 (2015).

[9] Heatmapjs, javascript API, http://www.patrick-wied.at/static/heatmapjs/

[10] Simple Mouse Tracker, http://smt.speedzinemedia.com/

[11] Mancas, Matei. “Relative influence of bottom-up and top-down attention.” Attention in cognitive systems. Springer Berlin Heidelberg, 2009. 212-226.

[12] Jiang, Ming, et al. “SALICON: Saliency in Context.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[13] Mediaanlyzer web site: http://www.mediaanalyzer.net

[14] Cadwell EEG, http://www.cadwell.com/

[15] Natus EEG, www.natus.com

[16] Emotiv EEG, https://emotiv.com/

[17] Müller, Matthias M., Thomas Gruber, and Andreas Keil. “Modulation of induced gamma band activity in the human EEG by attention and visual information processing.” International Journal of Psychophysiology 38.3 (2000): 283-299.

[18] Sauseng, Paul, et al. “A shift of visual spatial attention is selectively associated with human EEG alpha activity.” European Journal of Neuroscience 22.11 (2005): 2917-2926.

[19] Näätänen, R., Gaillard, A.W.K., and Mäntysalo, S., “Early selective-attention effect on evoked potential reinterpreted”, Acta Psychologica, 42, 313-329, 1978

[20] Sams, H., Paavilainen, P., Alho, K., and Näätänen, R., “Auditory frequency discrimination and event-related potentials”, Electroencephalography and Clinical Neurophysiology, 62, 437-448, 1985

[21] Näätänen, R., and Picton, T., “The N1 wave of the human electric and magnetic response to sound: a review and analysis of the component structure”, Psychophysiology, 24, 375-425, 1987

[22] Paavilainen, P., Alho, K., Reinikainen, K., Sams, M., and Näätänen, R., “Right hemisphere dominance of different mismatch negativities”, Electroencephalography and Clinical Neurophysiology, 78, 466-479, 1991

[23] Paavilainen, P., Karlsson, M.L., Reinikainen, K., and Näätänen, R., “Mismatch Negativity to change in spatial location of an auditory stimulus”, Electroencephalography and Clinical Neurophysiology, 73, 129-141, 1989

[24] Paavilainen, P., Jiang, D., Lavikainen, J., and Näätänen, R., “Stimulus duration and the sensory memory trace: An event-related potential study”, Biological Psychology, 35 (2), 139-152, 1993

[25] Aaltonen, O., Niemi, P., Nyrke, T., and Tuhkahnen, J.M., “Event-related brain potentials and the perception of a phonetic continuum”, Biological psychology, 24, 197-207, 1987

[26] Neville, H.J., and Lawson, D., “Attention to central and peripheral visual space in a movement detection task: an event-related potential and behavioral study. I. Normal hearing adults”, Brain Research, 405, 253-267, 1987

[27] Czigler, I., and Csibra, G., “Event-related potentials in a visual discrimination task: Negative waves related to detection and attention”, Psychophysiology, 27 (6), 669-676, 1990

[28] Alho, K., Woods, D.L., Alagazi, A., and Näätänen, R., “Intermodal selective attention. II. Effects of attentional load on processing of auditory and visual stimuli in central space”, Electroencephalography and Clinical Neurophysiology, 82, 356-368, 1992

[29] Crottaz-Herbette, S., “Attention spatiale auditive et visuelle chez des patients héminégligents et des sujets normaux : étude clinique, comportementale et électrophysiologique“, PhD Thesis, University of Geneva, Switzerland, 2001

[30] Desmedt, J.E., and Tomberg, C., “Mapping early somatosensory evoked potentials in selective attention: Critical evaluation of control conditions used for titrating by difference the cognitive P30, P40, P100 and N140”, Electroencephalography and Clinical Neurophysiology, 74, 321-346, 1989

[31] Downing, Paul, Jia Liu, and Nancy Kanwisher. “Testing cognitive models of visual attention with fMRI and MEG.” Neuropsychologia 39.12 (2001): 1329-1342.

[32] Cabeza, Roberto, and Lars Nyberg. “Imaging cognition II: An empirical review of 275 PET and fMRI studies.” Journal of cognitive neuroscience 12.1 (2000): 1-47.

Categories
Computational Attention Insights

What is attention? – Part 2: From neuroscience to computer science

Attention: the technology comes in

After the 1980th “crisis” in attention research, two different communities appeared in the study of attention with the arrival of tools providing new insights on brain behavior and with the increasing power of computers. One community deals with cognitive neuroscience and it intends, along with the cognitive psychology, to understand the deep mechanisms of attention, while the other community focuses on engineering and computer science and its goal is to develop attention models to be applied in signal processing and especially in image processing (Figure 1).

Fig. 1 Attention history: an accumulation of domains in onion layers
Fig. 1 Attention history: an accumulation of domains in onion layers

The arrival of new techniques and computational capacities brought fresh air (and results) in the study of attention.

Attention in cognitive neuroscience

Cognitive neuroscience arrived with a whole set of new tools and methods. If some of them were already used in cognitive psychology (EEG, eye-tracking devices …) others are new tools providing new insights on brain behavior:

  • Psychophysical methods: scalp recording of EEG (electroencephalography: measures the electric activity of the neurons) and MEG (Magnetoencephalography: measures avec the magnetic activity of the neurons) which are complementary in terms of sensitivity on different brain areas of interest.
  • Neuroimaging methods: functional MRI and PET scan images, which both measure the areas in the brain which have intense activity given a task that the subject executes (visual, audio …).
  • Electrophysiological methods: single-cell recordings which measure the electro-physiological responses of a single neuron using a microelectrode system. While this system is much more precise, it is also more invasive.
  • Other methods: TMS (transcranial magnetic stimulation which can be used to stimulate a region of the brain and to measure the activity of specific brain circuits in humans) and multi-electrodes technology which allows the study of the activity of many neurons simultaneously showing how different neuron populations interact and collaborate.The first and most well-known model is the one by Desimone and Duncan on biased competition [1]. The central idea is that at any given moment, there is more information in the environment than can be processed. Relevant information always competes with irrelevant information to influence behavior. Attention biases this competition, increasing the influence of behavior-relevant information and decreasing the influence of irrelevant information. Desimone explicitly suggest a physiologically plausible neural basis that mediates this competition for the visual system. A receptive field of the neuron is a window to the outside world. It reacts only to stimuli in this window and is insensitive to stimulation in other areas. The authors assume, that the competition between stimuli takes place if more than one stimulus share the same receptive field. This approach is very interesting as each neuron can be seen as a filter by itself and the neurons receptive field can be from very small and precise (like in the visual cortex V1) to very large which focus on entire objects (like IT brain area). This basic idea confirms different approaches of attention (location-based, feature-based, object-based, attentional bottleneck) in a very natural and elegant way. Moreover, a link is achieved with memory based on the notion of attentional templates in working memory which enhances neurons response depending on previous acquired data.While cognitive neuroscience brought a lot of new information to cognitive psychology, still the attention process is far from being fully understood and a lot of work is undergoing in the field. A second family of models was setup by Laberge in late 1990s [2]. It is a structural model based on neuropsychological findings and data from neuroimaging studies. Laberge conjectures that at least three brain regions are concurrently involved in the control of attention: frontal areas, especially the prefrontal cortex; thalamic nuclei, especially the pulvinar and posterior sites, the posterior parietal cortex and the interparietal sulcus. Laberge proposes that these regions are necessary for attention and all these regions presumably give rise to attentional control together. Using those techniques, two main families of theories raised.

Attention in computer science

While the cognitive neuroscience focuses on researching the very nature of attention, a different angle is approached in the 1980s with the developments of computational power. Building on Treisman and Gelade feature integration theory [3] C. Koch and S. Ullman [4] proposed that the different visual features that contribute to attentive selection of a stimulus (color, orientation, movement, etc.) are combined into one single topographic map, called the ”saliency map”. This one integrates the normalized information from the individual feature maps into one global measure. Bottom-up saliency is determined by how different a stimulus is from its surround at several scales. The saliency map provides the probability, for each region in the visual field, to be attended. This saliency map concept is close to that of the “master map” postulated in the feature integration theory by Treisman and Gelade.

The first computational implementation of Koch and Ullman architecture was achieved by Laurent Itti in his seminal work [5]. This very first computational implementation of an attention system takes as an input any image and outputs a saliency map of this image and also the winner-take-all-based mechanism simulating the eye fixations during selective attention. From that point, hundreds of models developed first for images, than for videos and some of them for audio or even 3D data very recently.

From the initial biologically-inspired models a bunch of models based on mathematics, statistics or information theory arrived on the “saliency market” predicting better and better human attention. They are all based on features extracted from the signal (most of the time low-level features but not always), such as luminance, color, orientation, texture, motion, objects relative position or even simply neighborhoods or patches from the signal. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for “contrasted, rare, surprising, novel, worthy-to-learn, less compressible, maximizing the information” areas. All those words are actually synonyms and they all amount to searching for some unusual features in a given context. This context can be local (typically center-surround spatial or temporal contrasts), global (whole image or very long temporal history), or it can be a model of normality (the image average, the image frequency content). Very recently learning is more and more involved into saliency computation: first it was mainly about adjusting model coefficients given a precise task, now complex classifiers like deep neural networks begin to be used to both extract the features from the signal and train the most salient features based on ground truth obtained with eye-tracking or mouse-tracking data.

So … what is attention?

The trans-domain nature of attention naturally lead to a lot of different definitions. Attention deals with the allocation of cognitive resources to important incoming information in order to bring them to a conscious state, update a scene model, update the memory and influence the behavior. But several attention mechanism were highlighted especially from Cherry’s cocktail party issue. A dichotomy appeared between divided attention and selective attention. From there, a clinical model of attention divided into five different “kinds” appeared. One can also talk about different kinds of attention when it needs the eye focus or not, or when it uses only the image features or also the memory and emotions… While its purpose seems to be the relation between the outer world and inner consciousness, memory and emotions, the clinical manifestation of attention tends to show that there might be several attentions.

Overt vs. covert: the eye

Overt versus covert attention is an attention property which was found at the very beginning of the psychological studies of attention. Overt attention is the one which can be exhibited by eyes activity, or more generally by focus of attention. Covert attention does not induce eye movements or a specific focus: it is the ability to catch (and thus be able to bring to consciousness) regions of an image which are not fixated by the eyes. The eye achieves mainly 3 types of movements which are dues to the non-linear repartition of receptive cells (cones and rods) on the retina. The cones which provide a high resolution and color are mainly concentrated in the middle of the retina in a region called “fovea”. This means that in order to acquire a good spatial resolution of an image the eye must gaze towards this precise area to align it on the fovea. This constraint led to mainly three types of the eye movements are the followings:

  1. Fixations: the gaze stays a minimal time period on approximately the same spatial area. The eye gaze is never still. Even when gazing a specific location, micro-saccades can be detected. The micro-saccades are very small movements of the eye during area fixations.
  2. Saccades: the eyes have a ballistic movement between two fixations. They disengage from one fixation and they are very rapidly shifted to the second fixation. Between the two fixations, no visual data is acquired.
  3. Smooth pursuit: a smooth pursuit is a fixation … on a moving object. The eye will follow a moving object to maintain it in the fovea (central part of the retina). During smooth pursuits, more brutal small correction can be done in case of eye retina movement. This smooth pursuit with small corrections is called a nystagmus.

Modelling covert attention will predict human fixations and the prediction of the dynamical path of the eye (called the eye “scanpath”).

Serial vs. parallel: the cognitive load

While focused, sustained and selective attention deal with a serial processing of information, alternating and divided attention deal with parallel processing of several tasks. These facts show that attention can deal with information both serially and in parallel. While there is a limit of the number of tasks which are processed in parallel during divided attention (around 5 tasks), in the case of pre-attentive processing, massively parallel computation can be done. Some notions as the gist [6] seem to be very fast and able to process (very roughly) the entire visual field to get a first idea about the context of the environment. The five kinds of attentions follow a hierarchy based on the degree of focus, thus the cognitive load which is needed to achieve the attentive task. This approach is sometimes called the clinical model of attention.

  1. Focused attention: respond to specific stimuli (focus on a precise task).
  2. Sustained attention: keep a consistent response during longer continuous activity (stay attentive a long period of time and follow the same topic).
  3. Selective attention: selectively maintain the cognitive resource on specific stimuli (focus only on a given object while ignoring distractors).
  4. Alternating attention: switch between multiple tasks (stop reading to watch something).
  5. Divided attention: deal simultaneously with multiple tasks (talking while driving).

Bottom-up vs. top-down: the memory and actions

Another fundamental property of attention needs to be taken into account: attention is a mix of two components called bottom-up (or exogenous) and top-down (or endogenous) components. The bottom-up component is reflex-based and uses the acquired signal. Attention is attracted by the novelty of some features in a given context (spatial local: there is a contrasted region, spatial global: there is a red dot while all the other are blue, temporal: there is a slow motion while before motion was fast…). Its main purpose is to alert in case of unexpected or rare situations and it is tightly related to survival. This first component of attention is the one which is the best modeled in computer science as the signal features are objective cues which can be easily extracted in a computational way.

The second component of attention (top-down) deals with individual subjective feelings. It is related to memory, emotions and individual goal. This component of attention is less easy to model in a computational way as it is more subjective and it needs to have cues about individual goals, a priori knowledge or emotions. Top-down attention can be itself divided into two sub-components:

  1. Goal/Action-related: depending on an individual current goal, certain features or locations are inhibited and other receive more weight. The same individual with the same prior knowledge responds differently to the same stimuli when the task in hand is different. This component is sometimes called “volitional”.
  2. Memory/Emotion-related: this process is related to experience and prior knowledge (and the emotions related to them). In this category one can find the scene context (experience from previously viewed scenes with similar spatial layouts or similar motion behavior) or object recognition (you see your grandmother first in the middle of other people). This component of attention is more “automatic”, it does not need an important cognitive load and it can come in addition to volitional attention. In the other direction the volitional top-down attention cannot inhibit the memory-related attention which will still work even if a goal is present or not. More generally, bottom-up attention cannot be inhibited if there is a strong and unusual signal acquired. If someone search for his keys (volitional top-down), he will not take care about a car passing by. But if he hears a strange sound (bottom-up) and then recognizes a lion (memory-related top-down attention), he will stop searching the keys and run away … Volitional top-down attention is able to inhibit the other components of attention only if the other attentions are not very important.

Attention vs. attentions: a summary

The study of attention is an accumulation of disciplines ranging from philosophy to computer science and passing by psychology and neuroscience. Those disciplines study sometimes different aspects or views of attention, which lead to the fact that giving a single and precise definition of attention is simply not feasible.

To sum-up the different approaches attention is about:

  • eye/neck mechanics and outside world information acquisition: the attentional “embodiment” leads to parallel and serial attention (overt versus covert attention)
  • allocation of cognitive resources to important incoming information: the attentional “filtering” is the first step towards data structuring (degree of focus and clinical model of attention)
  • mutual influence on memory and emotions: passing of important information to a conscious state and get feedback from memory and emotions (bottom-up and memory-related top-down attention)
  • behavior update: react to novel situations but also manage the goals and actions (bottom-up and volitional top-down attention)

Attention plays a crucial role from signal acquisition to action planning going through the main cognitive steps… or maybe there are simply several attentions and not only one. At this point this question still has no final answer.

References:

[1] Desimone, Robert, and John Duncan. “Neural mechanisms of selective visual attention.” Annual review of neuroscience 18.1 (1995): 193-222.

[2] Laberge (1999). Networks of Attention. In: Gazzaniga, Michael S., ed. The cognitive neurosciences. MIT press, 2004.

[3] Treisman, Anne M., and Garry Gelade. “A feature-integration theory of attention.” Cognitive psychology 12.1 (1980): 97-136.

[4] Koch, Christof, and Shimon Ullman. “Shifts in selective visual attention: towards the underlying neural circuitry.” Matters of intelligence. Springer Netherlands, 1987. 115-141.

[5] Itti, Laurent, Christof Koch, and Ernst Niebur. “A model of saliency-based visual attention for rapid scene analysis.” IEEE Transactions on pattern analysis and machine intelligence 20.11 (1998): 1254-1259.

[6] Torralba, Antonio, et al. “Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.” Psychological review 113.4 (2006): 766.

Categories
Computational Attention Insights

What is attention? – Part 1: From philosophy to psychology

A short history of attention:

Human attention is an obvious phenomenon which is active during every single moment of awareness. It was studied first in philosophy, followed by experimental psychology, cognitive psychology, cognitive neuroscience and finally computer science for modeling. Those studies are not a serial experience, but they add the one to the others as the layers of an “attention onion” (Figure 1).

Fig. 1 Attention history: an accumulation of domains in onion layers.
Fig. 1 Attention history: an accumulation of domains in onion layers.

Due to the high diversity of applications of attention, a precise and general definition is not easy to find. Moreover, the views on attention evolved in time and research domains. In this first part of an attempt to get a definition of attention, we go through a brief history on the related research from philosophy to cognitive psychology. This first part addresses the times where the study of attention was more or less included in one single community.

Conceptual findings: attention in philosophy

A first important study on human attention was the one of N. Malebranche, a French oratorian priest who was also a philosopher. In his “The search after truth” published in 1675, Malebranche focused on the role of attention as a structuring system in scene understanding and though organization.

In the 18th century, G. W. Leibniz introduced the concept of “apperception” which refers to the fact of assimilating new and past experience into a new view of the world [1]. Leibniz intuition is about an involuntary approach to attention (known today as a “bottom-up”) which is needed for a perceived event to become conscious.

In the 19th century, Sir W. Hamilton, a Scottish metaphysician, changed the previous view on attention which consisted in thinking that humans can only focus on a single stimulus at once. Hamilton noted that when people throw marbles, the placement of only about seven of the marbles could be remembered [2]. This finding opened the way to the notion of “divided attention” and led about one century later to the famous paper of G.A. Miller, “The Magical Number Seven, Plus or Minus Two” in 1956 [3].

Attention in experimental psychology

After the first philosophical approaches, attention entered in a scientific phase when approached by psychology. Based on an observation error detected in astronomy, W. Wundt introduced the study of consciousness and attention to the field of psychology [4]. He interpreted this observation error as the time needed to switch voluntarily one’s attention from one stimulus to another and initiated a series of studies on the mental processing speed as the ones achieved by F. Donders [5].

At the second half of the 19th century, H. Von Helmholtz, in his “Treatise on Physiological Optics” [6] noted that despite the illusion that we see all our environment in the same resolution, humans need to move their eyes around the whole visual field “because that is the only way we can see as distinctly as possible all the individual parts of the field in turn.” Even if he personally mainly inspected the eye movement scanpath (overt attention), he also treated on the existence of a covert attention (which does not induce eye movements). Von Helmholtz focused on the role of attention as an answer to the question “Where” the objects of interest are.

In 1890, W. James, published his textbook “The principles of psychology” [7] and remarked that attention is closely related to consciousness and structure. According to James, attention makes people perceive, conceive, distinguish, remember, and shorten reactions time. James indeed linked attention to the notion of data compression and memory. Contrary to Von Helmholtz, James is more focused on the fact that attention should answer to the question of “What” are the objects of interest.

Attention in cognitive psychology

Between the very beginning of the 20th century and 1949, the mainstream approach in psychology was the behaviorism. During this period, the study of mind was considered as barely scientific and no important advances were achieved in the field of attention. Despite this “hole” in the study of attention we can still find names as J. R. Stroop who worked on the “Stroop Effect” [8] showing that divergent stimuli (reading/color) heavily impair people performance.

After the Second World War and its practical questions on soldiers’ attention and the development of the cognitivism, the study of attention made a tremendous comeback. To the behaviorist view which states that the organism behavior is under environmental control, the cognitivism showed that behavior can be modulated by attention.

The come-back of attention begun with the work of C. Cherry in 1953 on the famous “cocktail party” paradigm [9]. This approach models how do people select the conversation that they are listening to and ignore the rest? This problem was called “focused attention”, as opposed to “divided attention”.

  1. Broadbent [10] summarized most of the findings known until then in a “bottleneck” model in which he described the selection properties of attention. The idea is that attention acts like a filter (selector) of relevant information based on basic features, such as color or orientation. If the incoming information matches the filter it can reach awareness (conscious state), otherwise it will be discarded. At that time, the study of attention seemed to become very coherent and was called “early selection”. Nevertheless, after this short positive period, most of the findings summarized by Broadbent proved to be conflicting.

The first “attack” came from the alternative model of Deutsch and Deutsch [11] who used some properties of the cocktail-party paradigm to introduce a “late selection” model, where the attentional selection is basically a matter of memory processing and response-selection. The idea is that all information is acquired, but is selected to reach awareness only the one which fits semantic or memory-related objects. This is an opposite view to Broadbent who professes an early selection of the features before they reach any further processing.

New models were introduced like the attenuated filter-model of A. Treisman [12] which is a softer version than Broadbent bottleneck and which let stimuli with a response higher than a given threshold switch the filter, thus the focus of the selective attention.

Later, in 1980, Treisman and Gelade [13] proposed a new “feature integration” theory, where attention occurs in two distinct steps: a preattentive parallel effortless step which analyze objects and extracts features from those objects. In a second step, those features are combined to obtain a hierarchy of focus attention which pushes information towards awareness.

Despite its high importance within the psychology theories, the feature integration was also highly disputed. Other theories emerged as M. Posner [14] spotlight supporting a spatial selection approach or D. Kahneman [15] and his theory of capacity supporting the idea of mental effort.

In the late 1980s, a bunch of theories on attention flourished and none of them was capable of including all previous findings. According to H. Pashler [16], cognitive psychology reached a dead-end. After several decades of research in cognitive psychology, more questions were raised than answers given. Pashler declared that “No one knows what attention is” as a provocative response to the famous “Everyone knows what attention is” proposed by James one century before.

The need for new approaches: after the late 1980s “crisis”

Attention deals with the allocation of cognitive resources to important incoming information in order to bring them to a conscious state, update the scene model and memory and influence behavior. Between consciousness, memory and behavior, attention revealed to be much more complex that initially expected and some people even question the fact that attention is one single concept or there are several “attentions”. Sometimes, attention became a kind of magical box where everything which could not be explained otherwise can get.

The number of issues and the complexity of the nature of attention led to an interesting move in the split of attention study from one single community into two different communities.

One has a will of getting further into the theoretical and the profound nature of attention (cognitive neuroscience) using adapted simple stimuli. The arrival of advanced tools such as functional imaging or single-cell recordings will allow them to make huge steps towards attention understanding.

The second community working in the attention field has a will of making the concept work with real data such as images, videos or others (computer science).  From the late 1990s and the first computational models of visual attention those two approaches developed in parallel, one trying to get more insight on the biological brain and the other trying to get results which can predict eye behavior for real-life stimuli. Even if the computational attention community led to some models very different from what is known to happen in the brain, the engineers’ creativity is impressive and the results on real-life data begin to be significant and the applications endless.

In a second part of our attempt to know more about what attention is, we will focus on cognitive neuroscience on one side and on computational attention on the other along with the known properties of attention.

References

[1] Runes, Dagobert D., ed. The dictionary of philosophy. Citadel Press, 2001.

[2] Hamilton, William. Lectures on metaphysics and logic. Vol. 1. Gould and Lincoln, 1859.

[3] Miller, George A. “The magical number seven, plus or minus two: some limits on our capacity for processing information.” Psychological review 63.2 (1956): 81.

[4] Wundt, Wilhelm Max. Principles of physiological psychology. Vol. 1. Sonnenschein, 1904.

[5] Goldstein, E. Cognitive psychology: Connecting mind, research and everyday experience. Cengage Learning, 2014.

[6] von Helmholtz, Hermann. Treatise on physiological optics. Vol. 3. Courier Corporation, 2005.

[7] James, William. “The principles of psychology, Vol II.” (1913).

[8] Jensen, Arthur R., and William D. Rohwer. “The Stroop color-word test: A review.” Acta psychologica 25 (1966): 36-93.

[9] Cherry, E. Colin. “Some experiments on the recognition of speech, with one and with two ears.” The Journal of the acoustical society of America 25.5 (1953): 975-979.

[10] Broadbent, Donald Eric. “A mechanical model for human attention and immediate memory.” Psychological review 64.3 (1957): 205.

[11] Deutsch, J. Anthony, and Diana Deutsch. “Attention: some theoretical considerations.” Psychological review 70.1 (1963): 80.

[12] Treisman, Anne M. “Selective attention in man.” British medical bulletin (1964).

[13] Treisman, Anne M., and Garry Gelade. “A feature-integration theory of attention.” Cognitive psychology 12.1 (1980): 97-136.

[14] Posner, Michael I. “Attention in cognitive neuroscience: an overview.” (1995).

[15] Friedenberg, Jay, and Gordon Silverman. Cognitive science: an introduction to the study of mind. Sage, 2011.

[16] Pashler, Harold E., and Stuart Sutherland. The psychology of attention. Vol. 15. Cambridge, MA: MIT press, 1998.

Categories
Computational Attention Insights

Why computers should be attentive?

Any animal [1] from the tiniest insect [2] to humans is perfectly able to “pay attention”. Attention is the first step of perception: it analyses the outer real world and turns it into an inner conscious representation. Even during some dreaming phases known as REM (Rapid Eye Movements), the eye activity proves that the attentional mechanism is at work. But this time it analyses a virtual world coming from the inner subconscious and turns it into an inner conscious representation. Attention seems to be not only the first step of perception, but also the gate to conscious awareness.

The attentional process probably activates with the first developments of a complex sense (like auditory) which comes with the first REM dreams beginning after the sixth months of foetal development [3]. This mechanism is one of the first cognitive processes to be set up and factors like smoke, drugs, alcohol or even stress during pregnancy lead to later attention disorders and even higher chances to develop psychopathologies [4][5]. It is largely proven that for cognitive psychopathologies, the attentive process is highly affected (like in autism or schizophrenia) mainly by studying eye tracking traces which can be very different between patients and the control groups [6][7]. The attentive process is set up as early as the prenatal period when it already begins to operate during babies dreams. Until death it occurs in every single moment of the day when people are awake, but also during their dreams. This shows the importance of attention: it cannot be dissociated from perception and consciousness. Even when the person is sleeping without dreaming and the eyes are not moving, important stimuli can “wake up” a person. Attention is never turned off, it can be only lowered and in standby (excepting drug-induced states when the consciousness is altered or eliminated as in artificial coma). It is thus safe to say that if there is conscious life in a body capable to act on its environment, there is attention.

As a gate of conscious awareness at the interface between inner and outer, attention can be both conscious (attentive) and unconscious (pre-attentive) and it is the key to survival. Attention is also a sign of limited computation capabilities. Vision, audition, touch, smell or taste, they all provide the brain with a huge amount of information. Gigabits of rough sensorial data flow every second into the brain which cannot physically handle such an information rate. Attention provides the brain with the capacity of selecting the main information and building priority tasks. While there are a lot of definitions and views of attention the one core idea which justifies attention regardless the discipline, methodology or intuition is “information reduction” [8].

Attention only begun to be seriously studied from the 19th century with the arrival of modern psychology. Some thoughts about the attention concepts may be found in Descartes, but no rigorous and intensive scientific study was done until the beginning of psychology. How the philosophers missed such a key concept as attention from the antic times to almost now? Part of the answer is given by William James, the father of psychology, in his famous definition of attention: “Everybody knows what attention is”. Attention is so natural, so linked to life and partly unconscious, so obvious that … nobody really noticed it until recently.

However, little by little, a new transversal research field appeared around the concept of “attention” gathering first psychologists, than neuroscientists and even since the end of the nineties’ engineers and computer scientists. While covering the whole research on attention needs a whole series of books, the topic is here narrowed to focus on attention modelling, a crucial step towards wider artificial intelligence.

Indeed, this key process of attention is currently rarely used within computers. As with the brain, a computer is a processing unit. As with the brain it has limited computation capabilities and memory. As with the brain, computers should analyse more and more data. But unlike the brain they do not pay attention. While a classical computer will be more precise in quantifying the whole input data, an attentive computer will focus on the most “interesting” data which has several advantages:

  • It will be faster and more efficient in terms of memory storage due to its ability to process only part of the input data.
  • It will be able to find regularities and irregularities in the input signal and thus be able to detect and react to unexpected or abnormal events.
  • It will be able to optimize data prediction by describing novel patterns, and depending on the information reduction result (how efficient the information reduction was), it will be capable of being curious, bored or annoyed. This curiosity which constantly pushes to the discovery of more and more complex patterns to better reduce information is a first step towards creativity.

As in humans attention is the gate to awareness and consciousness, in computers attention can lead to novel emergent computational paradigms beyond classical pre-programmed machines. While the way towards self-modifying computers is still very long ahead, computational attention develops in an exponential way letting more and more applications benefit from it.

References

[1] Zentall, Thomas R. “Selective and divided attention in animals.” Behavioural Processes 69.1 (2005): 1-15.
[2] Hoy, Ronald R. “Startle, categorical response, and attention in acoustic behavior of insects.” Annual review of neuroscience 12.1 (1989): 355-375.
[3] Hopson, Janet L. “Fetal psychology.” Psychology Today 31.5 (1998): 44.
[4] Mick, Eric, et al. “Case-control study of attention-deficit hyperactivity disorder and maternal smoking, alcohol use, and drug use during pregnancy.” Journal of the American Academy of Child & Adolescent Psychiatry 41.4 (2002): 378-385.
[5] Linnet, Karen Markussen, et al. “Maternal lifestyle factors in pregnancy risk of attention deficit hyperactivity disorder and associated behaviors: review of the current evidence.” American Journal of Psychiatry 160.6 (2003): 1028-1040.
[6] Holzman, Philip S., et al. “Eye-tracking dysfunctions in schizophrenic patients and their relatives.” Archives of general psychiatry 31.2 (1974): 143-151.
[7] Klin, Ami, et al. “Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism.”Archives of general psychiatry 59.9 (2002): 809-816.
[8] Itti, Laurent, Geraint Rees, and John K. Tsotsos, eds. Neurobiology of attention. Academic Press, 2005.