Computational Attention Insights

Applications of Saliency Models – Part Three

Catch up on Parts One and Two.
Applications based on abnormality processing

The third category of attention-based applications concerns abnormality processing. Some applications go further than the use of the simple detection of the areas of interest. They use comparisons between the areas on the saliency maps. Application domains such as robotics or advertisement highly benefit from this category of applications.

Robotics is a very large domain of application with various needs. There are three research axes where robots can take advantage from saliency models: 1) image registration and landmarks extraction, 2) object recognition, and 3) robots action guidance.

An important need of a robot is to know where it is located. For this aim, the robot can use the data from its sensors to find landmarks (salient features extraction) and register images taken at different times (salient features comparison) to build a model of the scene. The general process of real-time building of a view of the scene is called Simultaneous Localization and Mapping (SLAM). Saliency models can help a lot in the extraction of more stable landmarks from images which can be more robustly compared [25]. Those techniques imply first the computation of saliency maps, but the results are not used directly: they need to be further processed (especially comparisons of salient areas).

Another important need of robots after they establish the scene, is to recognize the objects which are present in this scene and which might be interesting to interact with. Two steps are needed to recognize objects. First of all, the robot needs to detect the object in a scene. For this goal saliency models can help a lot as they can provide information about proto-objects [26] or areas objectness [27]. When objects are detected, they need to be recognized. In this area the main approach is to 1) extract features (SIFT, SURF or any others) from the object 2) filter the features based on a saliency map 3) perform the recognition based on a classifier (such as a SVM or others). Papers like [28] or [29] apply this technique which let a computer drastically decrease the number of needed keypoints to perform the object recognition. Another approach was used in [30] or [31]. Here the features which are mostly present in the searched object and not present in the surroundings are learned and this learning phase provides a new set of weights for bottom-up attention models. In this way, the features which are the most discriminant in the searched object will get the higher response in the final saliency map. A third approach can be found in [32] where relative position of salient points (called cliques) are used for image recognition.

Once robots know where they are (attentive visual SLAM) and they recognize objects around them (attentive object recognition), they need to decide what to do next. One of the decisions they need to make is to know where to look next and this decision is obviously taken based on visual attention. Several robots implement multi-modal attention like the iCub robot. They combine visual and audio saliency in an egosphere and this is used to point the gaze on the next location. An interesting survey on attention for interactive robots can be found in [33].

Another domain is is also part of this abnormal region processing category of applications: visual communication optimization. Marketing optimization can be applied to a large amount of practical cases such as: web sites, advertisement, product placement in supermarkets, signage, 2D and 3D objects placement in galleries.

Among the different applications of automatic saliency computation, the marketing and communication optimization is probably one of the closest to market. As it is possible to predict an image attention map, which is a map of the probability that people attend each pixel of the image, it is possible to predict where people are likely to look on a marketing material like an advertisement or a website. Attracting customer attention is the first step of the process of attracting people interest, induce desire and need for the product and finally push the client to buy it.

Feng-GUI [34] is an Israeli company mainly focusing on web pages and advertising optimization even if the algorithm is also capable to analyze video sequences. AttentionWizzard [35] is a US company mainly focusing on web pages. There are few a hints on the used algorithm, but it uses bottom-up features like: color differences, contrast, density, brightness and intensity, edges and intersections, length and width, curves and line orientations. Top-down features include face detection, skin color and text (especially big text) detection. 3M VAS [36] is the only big international player in this field. Very few details are given on the used algorithm, but it is also capable to provide video saliency. They provide attention maps for web pages optimization, but also advertisement with static images or videos, packaging or in-store merchandising. Eyequant [37] is a German company specialized in website optimization. Their algorithm use extensive eye-tracking tests to train the algorithm and make it closer to real eye-tracking for a given task. All those companies claim around 90 % accuracy for the first 3/5 viewing seconds [38]. They base their claim on different comparison between their algorithm and several existing databases using several ROC metrics. They always compare the results with the maximum ROC score obtained by the human users. Nevertheless, for real-life images and for given tasks and emotion-based communication, this accuracy dramatically drops but still remains usable.

With more and more 3D objects which are created, manipulated, sold or even printed, 3D saliency is a very promising future research direction. The main idea is to compute the saliency score of each view of a 3D model: the best viewpoint is the one where the total object saliency is maximized [39]. Mesh saliency was introduced based on adapting to the mesh structure concepts for 2D saliency [40]. The notion of viewpoint and mesh simplification are also related through the use of mesh saliency [41]. While the best viewpoint application can be used for computer graphics or even 3D mesh compression, marketing is one of the targets of this research topic: more and more 3D objects are shown even on internet and the question of how to display them in an optimal way is very interesting in marketing.


During the last two decades, significant progresses have been made in the area of visual attention.
Regarding the applications, 3 categories taxonomy is proposed here:

  • Abnormality detection: use the most salient areas detection.
  • Normality detection: use the less salient areas detection.
  • Abnormality processing: compare and further process the most salient areas.

This categories let us simplify and classify a very long list of applications which can benefit from attention models. We are just at the early stages of the use of saliency maps into computer vision applications. Nevertheless, the number of already existing applications shows a promising avenue for saliency models in improving existing applications, and for the creation of new applications. Indeed, several factors are nowadays turning saliency computation from labs to industry:

  • The models accuracy drastically increased in two decades both concerning bottom-up saliency and top-down information and learning.
  • The models working both on videos and images are more and more numerous andprovide more and more realistic results. New models including audio signals and 3D data are released and are expected to provide convincing results in the near future.
  • The combined enhancement of computing hardware and algorithms optimization led to real-time or almost real-time good quality saliency computation.

25. Frintrop, S. and Jensfelt, P. (2008) Attentional landmarks and active gaze control for visual slam. Robotics, IEEE Transactions on, 24 (5), 1054–1065.
26. Walther, D. and Koch, C. (2006) Modeling attention to salient proto-objects. Neural networks, 19 (9), 1395–1407.
27. Alexe, B., Deselaers, T., and Ferrari, V. (2010) What is an object?, in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, pp. 73–80.
28. Zdziarski, Z. and Dahyot, R. (2012) Feature selection using visual saliency for content-based image retrieval, in Signals and Systems Conference (ISSC 2012), IET Irish, IET, pp. 1–6.
29. Awad, D., Courboulay, V., and Revel, A. (2012) Saliency filtering of sift detectors: application to cbir, in Advanced Concepts for Intelligent Vision Systems, Springer, pp. 290–300.
30. Navalpakkam, V. and Itti, L. (2006) An integrated model of top-down and bottom-up attention for optimizing detection speed, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, IEEE, vol. 2, pp. 2049–2056.
31. Frintrop, S., Backer, G., and Rome, E. (2005) Goal-directed search with a top-down modulated computational attention system, in Pattern Recognition, Springer, pp. 117–124.
32. Stentiford, F. and Bamidele, A. (2010) Image recognition using maximal cliques of interest points, in Image Processing (ICIP), 2010 17th IEEE International Conference on, IEEE, pp. 1121–1124.
33. Ferreira, J.F. and Dias, J. (2014) Attentional mechanisms for socially interactive robots–a survey. Autonomous Mental Development, IEEE Transactions on, 6 (2), 110–125.
34. Feng gui website proposes automatic saliency maps for marketing material. URL
35. Attention wizzard website proposes automatic saliency maps for marketing material. URL https: //
36. 3m vas website proposes automatic saliency maps for marketing material. URL
37. Eyequant website proposes automatic saliency maps for marketing material. URL
38. Page containing the 3m vas studies showingalgorithm accuracy in general and in a marketing framework. URL
39. Takahashi, S., Fujishiro, I., Takeshima, Y., and Nishita, T. (2005) A feature-driven approach to locating optimal viewpoints for volume visualization, in Visualization, 2005. VIS 05. IEEE, IEEE, pp. 495–502.
40. Lee, C.H., Varshney, A., and Jacobs, D.W. (2005) Mesh saliency, in ACM transactions on graphics (TOG), vol. 24, ACM, vol. 24, pp. 659–666.
41. Castelló, P., Chover, M., Sbert, M., and Feixas, M. (2014) Reducing complexity in polygonal meshes with view-based saliency. Computer Aided Geometric Design, 31 (6), 279–293.

Computational Attention Insights

Attention in computer science – Part 2

In the previous part we mainly dealt with visibility models and static saliency models of attention. But the notion of computational attention could not remain only focused on static images and it developed in other modalities.

  1. Video saliency

Some still image models were simply extended to video. For example, Seo & Milanfar (2009) introduced the time dimension by replacing square spatial patches by 3D spatio-temporal cubic patches where the third dimension is the time. Itti’s model was also generalized with the addition of motion features and flickering to the initial spatial set of features containing luminance, color and orientations. Those models mainly show that important motion is salient. A question might be: what saliency models can bring more than a good motion detector? Models like Mancas et al. (2011) have developed a bottom-up saliency map to detect abnormal motion. The model exhibits promising results from a few moving objects to dense crowds with increasing performance. The idea is to show that motion is most of the time salient but within motion, some moving areas are more interesting than others.

  1. 3D saliency

3D saliency modeling is an emerging area of research, which was boosted by two facts: First, the arrival of affordable RGB-D cameras which provide both classical RGB images and a depth map describing pixels distance from the camera. This depth information is very important and it provides new features (curvature, compactness, convexity, …). The second event a higher availability of 3D models (used for example in 3D printing). 3D models are more easily available and libraries like PCL Aldoma et al. (2012) can handle 3D point clouds, convert formats and compute features from those point clouds. As for video, most of the 3D saliency models are extensions of still images models. Some use the 3D meshes based on Itti’s approach, others just add the depth as an additional feature while recent models are based on the use of point clouds. As 3D saliency models are mainly extensions of 2D models, depending on the extended model, the different features can be taken into account locally and/or globally on the 3D objects.

  1. Audio saliency

There are very few auditory attention models compared to visual attention models. One approach deals with the local context for audio signals. Kayser et al. (2005) computes auditorysaliencymapsbasedonItti’svisualmodel(1998). First, thesoundwaveisconverted to a time-frequency representation. Then three auditory features are extracted on different scalesandinparallel(intensity, frequencycontrast, andtemporalcontrast). Foreachfeature, the maps obtained at different scales are compared using a center-surround mechanism and normalized. Finally, a linear combination builds the saliency map which is then reduced to one dimension to be able to fit on the one-dimensional audio signal. Anotherapproachtocomputeauditorysaliencymapisbasedonfollowingthewell-established approach of Bayesian Surprise in computer vision (Itti & Baldi (2006)). An auditory surprise is introduced to detect acoustically salient events. First, a Short-Time Fourier transform (STFT) is used to calculate the spectrogram. The surprise is computed in the Bayesian framework.

  1. Top-down saliency

Top-down is endogenous information and comes from the inner world (information from memory, their related emotional level and also the task-related information). In practice, two main families of top-down information can be added to bottom-up attention models.

4.1       What is normal?

The first one mainly deals with learned normality which can come from the experience from the current signal if it is time varying, or from previous experience (tests, databases). Concerning still images, the “normal” gaze behavior can be learned from the “mean observer”. Eye-tracking techniques Judd et al. (2009) or mouse-tracking Mancas (2007) can be used on several users, and the average of their gaze on a set of natural images can be computed. The results show that, for natural images, the eye gaze is attracted by the center of the images. This observation for natural images is very different from more specific images which use a priori knowledge. Mancas (2009) showed using mouse tracking that gaze density is very different on a set of advertisements and on a set of websites. This is partly due to a priori knowledge that people have about those images (areas containing title, logo, menu). For video signals, it is also possible to accumulate in time motion patterns for each extracted feature to get a model of normality. After a given period of observation, the model can detect that in a given location moving objects are generally fast and going from left to right. If an object, at the same location, is slow and/or going from right to left, this is surprising given what was previously learned from the scene, thus attention will be directed to this object. For 3D signals, another information is the proximity of objects. A close object is more likely to attract attention as it is more likely to be the first that we will have to interact with.

4.2       Where are my keys?

While the previous section dealt with attention attracted by events which lead to situations which are not consistent with the knowledge acquired about the scene, here we focus on a second main top-down cue which is a visual task (“Find the keys!”). This task will also have a huge influence on the way the image is attended and it will imply object recognition (“Recognize the keys”) and object usual location (“they could be on the floor, but never on the ceiling”).

Object recognition can be achieved through classical methods or using points of interest (like SIFT, SURF …Bay et al. (2008)). Some authors integrated the notion of object recognition into the architecture of their model like Navalpakkam & Itti (2005). They extract the same features as for the bottom-up model, from the object and learn them. This learning step will provide weight modification for the fusion of the conspicuity maps which will lead to the detection of the areas which contain the same feature combination as the learned object. Another approach in adding top-down information is in providing with a higher weight the areas from the image which have a higher probability to contain the searched object. Several authors as Oliva et al. (2003) developed methods to learn objects’ location.

  1. Learning bottom-up and top-down together

Recently, learning the salient features becomes more and more popular: the idea is not to find the rare regions, but to find an optimal description of those rare regions which are already known from eye-tracking or mouse-tracking ground truth. The learning is based on deep neural networks, sparse coding and pooling based on large images datasets where the regions of interest are known. The most attended regions based on eye-tracking results are used to train classifiers which will extract the main features of these areas. The use of deep neural networks greatly improved those techniques which are now able to extract meaningful middle and high level features which can describe the best the salient regions Shen & Zhao (2014). Indeed, this learning step will find the classical bottom-up features in the first layers, but it will also add context, centred gaussian, object detection (faces, text) and recognition together. An issue with those methods is a loss of generality of the models which will work for given datasets, even if, the deep learning is able to cope with high variability in the case of general images for example.

  1. Attention in computer science

In computer science there are two families of models: some are based on feature visibility and others on the concept of saliency maps, the latter approach being the most prolific. For saliency-based bottom-up attention the idea is the same for all the models: find areas in the image which are the most surprising in a given context (local, global or normality-based). Finally a set of top-down features which can influence the saliency-based models are reviewed. Recently deep neural networks are used to integrate both bottom-up and top-down information in the same time.

  1. References

Aldoma, A., Marton, Z.-C., Tombari, F., Wohlkinger, W., Potthast, C., Zeisl, B., Rusu, R. B., Gedikli, S. & Vincze, M. (2012). Point cloud library, IEEE Robotics & Automation Magazine 1070(9932/12).

Bay, H., Ess, A., Tuytelaars, T. & Gool, L. V. (2008). Surf: Speeded up robust features, Computer Vision and Image Understanding (CVIU) 110(3): 346–359.

Itti, L. & Baldi, P. F. (2006). Modeling what attracts human gaze over dynamic natural scenes, inL.Harris&M.Jenkin(eds), Computational Vision in Neural and Machine Systems, Cambridge University Press, Cambridge, MA.

Judd, T., Ehinger, K., Durand & Torralba, A. (2009). Learning to predict where humans look, IEEE Inter. Conf. on Computer Vision (ICCV), pp. 2376–2383.

Kayser, C., Petkov, C., Lippert, M. & Logothetis, N. K. (2005). Mechanisms for allocating auditory attention: An auditory saliency map, Curr. Biol. 15: 1943–1947.

Mancas, M. (2007). Computational Attention Towards Attentive Computers, Presses universitaires de Louvain.

Mancas, M. (2009). “relative influence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.

Mancas, M., Riche, N. & J. Leroy, B. G. (2011). Abnormal motion selection in crowds using bottom-up saliency, IEEE ICIP.

Navalpakkam, V. & Itti, L. (2005). Modeling the influence of task on attention, Vision Research 45(2): 205–231.

Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.

Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL:

Shen, C. & Zhao, Q. (2014). Learning to predict eye fixations for semantic contents using multi-layer sparse network, Neurocomputing 138: 61–68.

Computational Attention Insights

Attention in computer science – Part 1

Numediart Institute, Faculty of Engineering (FPMs), University of Mons (UMONS) Matei Mancas, 31 Bd. Dolez, 7000 Mons, Belgium

Idea and approaches. As we already saw, attention is a topic which was taken into account by philosophy first, it was than discussed by cognitive psychology and neuroscience and, only in the late nineties, attention modeling arrived in the domain of computer science and engineering. In this domain, two main approaches can be found. The first one is based on the notion of “saliency”, while the second one on the idea of “visibility”. In reality, the models based on saliency are by far more spread than the visibility models in computer science. The notion of “saliency” implies a competition between “bottom-up” or exogenous and “topdown” or endogenous information. The idea of bottom-up saliency maps is that the sight of people will direct to areas which, in some way, stand out from the background based on novel or rare features. This bottom-up saliency can be modulated by top-down information based on memory, emotions or goals. The eye movements (scan paths) can be computed from the saliency map which remains the same during eye motion: it is a global static attention (saliency) map which only provides, for each pixel, a probability to attract human gaze.

Visibility models. These models of human attention assume that people attend locations that maximize the information acquired by the eye (the visibility) to solve a given task (which can also be simply free viewing). In this case top-down information is naturally included in the notion of task along with the dynamic bottom-up information maximization. The eye movements are in this approach directly an output from the model and do not have to be inferred from a “saliency map” which is considered as a surface giving the posterior probability (following each fixation) that the target is at each scene location Geisler & Cormack (2011). Compared to other Bayesian frameworks, like the one of Oliva et al. (2003), visibility models have one main difference. The saliency map is dynamic: indeed visibility models make explicit the resolution variability of the retina (Figure 1): in that way an attention map is “re-computed” at each new fixation, as the feature visibility changes at each of these fixations. Tatler (2007) introduces a tendency of the eye gaze to stay in the middle of the scene to maximize the visibility over the image (which reminds the centered preference for natural images also called centered Gaussian bias.


Figure 1: Depending on the eye fixation position, visibility thus feature extraction is different. Adapted from images by Jeff Perry.

The visibility models are much more used in the case of strong tasks (like Legge et al. (2002) who proposed a visibility model capable to predict the eye fixations during the task of reading) and few of them are applied to free viewing which is considered as a week task Geisler & Cormack (2011).

Saliency approaches: bottom-up methods. While visibility models are more used in cognitive sciences and with strong tasks, in computer science, bottom-up approaches use features extracted only once from the signal independently from the eye fixations mainly for free-viewing. Features are extracted from the image, such as luminance, color, orientation, texture, objects relative position or even simply neighborhoods or patches. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for contrasted, rare, surprising, novel, worthy to learn, less compressible, maximizing the information areas. All those definitions are actually synonyms and they all amount to searching for some unusual features in a given spatial context. In the following, we provide examples of contexts used for still images to obtain a saliency map. This saliency map can be visualized as a heatmap where hot colors represent pixels with a higher probability to attract human gaze (Figure 2).


Figure 2: Left: initial image. Right: superimposed saliency heatmap on the initial image. The saliency map is static and gives an overview of where the eye is likely to attend.

Saliency methods for still images. The literature is very active concerning still images saliency models. Those models have various implementations and technical approaches even if initially they all derive from the same idea. It is not the purpose here to provide a review of all those models, but we instead propose a taxonomy to classify those models. We structure this taxonomy of saliency methods on the context that those methods take into account to exhibit image novelty. In this framework, there are three classes of methods.

The first one focuses on pixel’s surroundings: here a pixel, a group of pixels or a patch is compared with its surroundings at one or several scales. The main idea is to compute visual features at several scales in parallel, to apply center-surround inhibition, combination into conspicuity maps (one per feature) and finally to fuse them into a single saliency map. There are a lot of models derived from this approach which mainly use local center-surround contrast as a local measure of novelty. A good example of this family of approaches is the Itti’s model Itti et al. (1998) which is the first implementation of the Koch and Ullman model. This implementation proved to be the first successful approach of attention computation by providing better predictions of the human gaze than chance or simple descriptors like entropy.

A second class of methods will use as a context the entire image and compare pixels or patches of pixels with other pixels or patches from other locations in the image but not necessarily in the surroundings of the initial patch. The idea can be divided in two steps. First, local features are computed in parallel from a given image. The second step measures the likeness of a pixel or a neighborhood of pixels to other pixels or neighborhoods within the image. A good example can be found in Seo & Milanfar (2009) which first proposes to use local regression kernels as features. Second it uses a nonparametric kernel density estimation for such features, which results in a saliency map of local “self-resemblance” measure. Mancas (2009) and Riche et al. (2013) focus on the entire image. These models are designed to detect saliency in the areas which are globally rare and locally contrasted. Boiman & Irani (2007) look for similar patches and relative positions of these patches in an image.

Finally, the third class of methods will take into account a context based on a model of what the normality should be: if things are not like they should be, this can be surprising, thus attract people attention. Achanta et al. (2009) proposed a very simple attention model: a distance is computed between a smoothed version of the input image and the average color vector of the input image. The average image is used as a kind of model of the image statistics: pixels which are far from those statistics are more salient. This model is mainly useful in salient objects detection. Another approach to “normality” can be found in Hou & Zhang (2007), where the authors proposed a spectral model that is independent of any features. The difference between the log-spectrum of the image and its smoothed log-spectrum (spectral residual) is reconstructed into a saliency map. Indeed, a smoothed version of the log-spectrum is closer to a a f1  decreasing log-spectrum template of normality as small variations are removed. This approach is almost as simple as Achanta et al. (2009) but more efficient in predicting eye fixations.

Towards video, audio or 3D signals and top-down attention. In the next parts we will focus on other kind of signals such as moving images (video), audio or even 3D signals. In addition, even if the top-down information is less modeled for saliency approaches, there is anyway an important literature linked to the topic which will also be detailed in the next parts.

Achanta, R., Hemami, S., Estrada, F. & Susstrunk, S. (2009). Frequency-tuned Salient Region Detection, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). URL:
Boiman, O. & Irani, M. (2007). Detecting irregularities in images and in video, International Journal of Computer Vision 74(1): 17–31.
Geisler, W. S. & Cormack, L. (2011). Chapter 24: Models of Overt Attention, in The Oxford handbook of eye movements, Oxford University Press.
Hou, X. & Zhang, L. (2007). Saliency detection: A spectral residual approach, Proc. IEEE Conf. Computer Vision and Pattern Recognition CVPR ’07, pp. 1–8.
Itti, L., Koch, C. & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11): 1254 –1259.
Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.chips 2002: new insights from an idealobserver model of reading, Vision Research pp. 2219–2234.
Mancas, M. (2009). “relative influence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.
Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.
Riche, N., Mancas, M., Duvinage, M., Mibulumukini, M., Gosselin, B. & Dutoit, T. (2013). Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis, Signal Processing: Image Communication 28(6): 642–658.
Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL:
Tatler, B. (2007). The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions, Journal of Vision 7.

Computational Attention Insights

How to measure attention?

There are a lot of ways to measure attention. Some, mainly in psychology, are more qualitative and use questionnaires and their interpretation. Some are quantitative but they focus on the participants feedback (button press, click, etc…) when they see/hear/sense a stimulus.

Here we focus on quantitative techniques which provide fine-grain information about the attentive responses. The attentive response can be either measured directly in the brain, or indirectly through the participants’ eye behavior. Only one of the techniques which are described here is based on participant active feedback: mouse tracking. This is because the mouse tracking feedback is very close to the one of the eye-tracking and this is an emerging approach of interest for the future: it needs less time, less money and provide more data than classical eye-tracking.

Eye-tracking: an indirect cue about covert attention

The use of an eye-tracker is probably the most widely used tool for attention measurement. The idea is to use a device which is able to precisely measure the eyes gaze which obviously only provide information concerning covert attention.

The eye-tracking technology highly evolved during time. Different technologies are described in [1]. One of the first techniques is the EOG (Electro-OculoGraphy). The idea is to measure the skin electric potential around the eye which give the eye direction relative to the head. This issue implies that for a complete eye-tracking system the head must either be attached to a still system or a head tracker system must be used in addition to the EOG. In order to get more precise results, special lenses can be used instead of EOG, but in this case the technique is more invasive and it also only provides the eye direction relative to the head and not the eye gaze as an intersection with a screen for example.

The technique that most of the current commercial and research solutions use is based on the video detection of pupil/corneal reflection. Indeed, an infra-red source sends the light towards the eyes. The light is reflected by the eye and the position of the reflection is used to compute the gaze direction.

While the technique is most of the time the same, the embodiment of the eye-tracker can be very different. The main eye-tracking manufacturers propose the system under different forms [2][3][4].

  1. Some eye-trackers are directly included into the screen which is used to present the data. This setup has the advantage of a very short calibration, but it can only be used with its own screen.
  2. Separate cameras need some additional calibration time but the tests can be done on any screen and even in a real scene by using a scene camera.
  3. The eye-tracking glasses can be used in a very ecological setup, even outside on real-life scenes. An issue of those systems is that it is not easy to aggregate the data from several viewers as the scene which is viewed is not the same. The aggregation needs a non-trivial registration of the scenes.
  4. Cheap devices begin to appear and quite precise cameras are sold less than 100 EUR [5] which is a fraction of the price of a professional eye-tracker. An issue with those eye-trackers is that they are sold with minimal software and it is often difficult to synchronize the stimuli and the related recorded data. Those eye-trackers are mostly used as real-time human-machine interaction devices. Nevertheless, open source projects exist which allow to record data from low cost eye-trackers like Ogama [6].
  5. Finally, webcam-based software is freely available [7]. They are able to provide good quality data and to be used remotely with existing webcams [8].

Mouse-tracking: the low-cost eye-tracking

If eye tracking is the most reliable ground truth in the study of covert visual attention, it needs a good practice for the operator, it has some mandatory constraints for the user (the head might be attached, the calibration process may be long), and it needs a complex system which has a certain cost.

A much simpler way to acquire data about visual attention may be the use of mouse tracking. The mouse can be precisely followed while an Internet browser is opened by using a client-side language like JavaScript. The mouse precise position on the screen can be either captured using a home-made code or some existing libraries like [9][10]. This technique may appear as not very reliable; however, all depends on the context of the experiment.

  1. The first case is the one where the experiment is hidden to the participant: the participant is unaware about the fact that the mouse motion is recorded. In this case the mouse motion is not accurate enough. Indeed there is no automatic following of the eye gaze by the hand even if a tendency of the hand (and consequently the mouse) to follow the gaze is visible. Sometimes the mouse is only used to scroll a page and the eyes are very far from the mouse pointer for example.
  2. The second case is the one where the participant is aware about the experiment and he has a task to follow. This can go from a simple “point the mouse where you look” instruction as in [11] with the first use of mouse tracking for saliency evaluation to more recent approaches as the one of SALICON in [12] where multi-resolution interactive pointing mimicking the fovea resolution is used to push people to point the mouse curser where they look.

In this second case where the participant is aware about his mouse motion tracking, the results of mouse tracking are very close to eye-tracking as shown by Egner and Scheier on their website [13]. However, some unconscious eye movements may be missed, but is this really an issue?

The main advantages of mouse tracking are low price and the complete transparency for the users (they just move a mouse pointer).

However, mouse tracking has several drawbacks:

  • The first place where the mouse pointer is located is quite important as the observer may look for the pointer. Should it be located outside the image or in the centre of the image? Ideally, the pointer should initially appear randomly in the image to avoid introducing a bias of the initial position of the pointer.
  • Mouse tracking only highlights areas which are consciously important for the observer. This is more a theoretical drawback as in practice, one should try to predict the conscious interesting regions.
  • The pointer hides the image region it overlaps, thus the pointer position is never on the important areas but very close to them. This drawback may be partially eliminated by the low-pass filter step performed after the mean of the whole observer set. It is also possible to make a transparent pointer as in [12].

EEG: Get the electric activity from the brain

The EEG technique (ElectroEncephaloGraphy) uses electrodes which are located on the participant scalp. Those electrodes amplify the electrical waves coming from the brain. An issue of this technique is that the skull and scalp attenuates those electrical waves.

While classical research setups have a high number of electrodes with manufacturers like [14][15], some low-cost commercial systems like Emotiv [16] are more compact and easy to install and calibrate. While the latter are easier to use, they are obviously less precise.

EEG studies provided interesting results as the modulation of the gamma band [17] during selective visual attention. Other papers [18] also provide cues about the alpha band modification during attentional shifts.

One very important cue about attention which can be measured using EEG is the P300 event-related potential.

The work of Näätänen et al. [19] in 1978 on the auditory attention provided evidences that the evoked potential has an improved negative response when the subject was presented with rare stimuli compared to frequent ones. This negative component is called the mismatch negativity (MMN), and it was observed in several experiments. The MMN occurs 100 to 200 ms after the stimuli, a time which is perfectly in the range of the pre-attentive attention phase.

Depending on the experiments, different auditory features were isolated: audio frequency [20], audio intensity [19][21][22], spatial origin [23], duration [24] and phonetic changes [25]. All these features were not salient alone, but saliency was induced by the rarity of each one of these features.

The study of the MMN signal for visual attention was led several times in conjunction with audio attention [26][27][28]. But a few experiments were made using only the visual stimuli. Crottaz-Herbette led in her thesis [29] an experiment in the same conditions as for auditory MMN in order to find out if a visual MMN really exists. The result was clearly positive with a high increase of the negativity of the evoked potential when seeing rare stimuli compared with the evoked potential when seeing frequent stimuli. The visual MMN occurs from 120 to 200 ms seconds after the stimulus. The 200 ms frontier strangely matches with the 200 ms needed to initiate a first eye movement, thus to initiate the “attentive” serial attentional mechanism. As for the audio MMN detection, no specific task was asked of the subject who only had to see the stimuli, this MMN component is thus pre-attentive unconscious and automatic.

This study and others [30] also suggest the presence of a MMN response for the somesthesic modality (touch, taste, etc…)

The MMN seems to be a universal component illustrating the brain reaction to an unconscious pre-attentive process. Any unknown stimulus (novel, rare) will be very salient as measured by P300 as the brain will try to know more about it. Rarity is the major engine of the attentional mechanism for visual, auditory and all the other signals acquired from our environment.

Functional imaging: fMRI

The MRI stands for Magnetic Resonance Imaging. The main idea behind this kind of imaging system is that human body is mainly made of water which is itself composed of hydrogen atoms composed of a single proton. Those protons have a magnetic moment (spin) which is randomly oriented most of the time. The MRI device will set up a very high magnetic field which will have as consequence to align the magnetic moment of the protons of the patient body. Radio Frequency (RF) impulsions orthogonal to the initial magnetic field push the protons to align to this new impulsion and they will align back to the initial magnetic field while releasing RF waves. Those waves are captured and they help in constructing an image where clear gray levels mean that there are more protons, therefore, more water in the body parts (like in fat for example) and a darker gray level reveal regions with less water like bones for example.

MRI is initially an anatomical imaging technique, but there is a functional version called fMRI using the BOLD approach. In this case a substance which has magnetic properties is injected into the blood. If a body part or a region of the brain is in its basal activity state, then the substance keeps its initial composition. If the blood pressure is higher with more oxygen (activated state), then the substance composition will change and the magnetic response to MRI will be much higher. In that way, when a region in the brain, for example, is activated, then the blood will have an increased flow and the activated state will push to a high response. fMRI imaging is thus capable of detecting the areas in the brain which are active and to become a great tool for neuroscientists which can visualize which area in the brain is activated during an attention-related patient exercise.

Functional imaging: MEG

MEG stands for MagnetoEncephaloGraphy. The idea is simple: while the EEG detects the electrical field which is heavily distorted when traversing the skull and skin, MEG detects the magnetic field induced by this electrical activity. The magnetic field has the advantage not to be influenced by the skin or the skull. While the idea is simple, in practice the magnetic field is very low which makes it very difficult to measure. This is why the MEG imaging is relatively new: the technological advances let the MEG be more effective based on SQUID (Superconducting Quantum Interference Devices). The magnetic field of the brain can induce electricity in a superconducting device which can be precisely measured. Modern devices have spatial resolutions of 2 millimetres and temporal resolutions of some milliseconds. Moreover, MEG images can be superimposed on MRI anatomic images which help in rapidly localise the main active areas. Finally, participants to MEG imaging can have a sit position which is more natural during exercises than the horizontal position of fMRI or PET scan.

Functional imaging: PET Scan

As for fMRI, PET scan (Positron Electron Tomography) is also a functional imaging and it can thus produce also a higher signal in case of brain activity. The main idea of PET scan is that the substance which is injected to the patient releases positrons (anti-electrons which are particles of the same properties as an electron but with positive charges). Those positrons will almost instantaneously meet an electron and have a very exo-energetic reaction (called annihilation). This annihilation will transform the whole mass of the two particles into energy and release to gamma photons in two opposite directions which will be detected by the scanner sensors. The substance which is injected will go and fixate on the areas of the brain which are the most active, which means that those areas will exhibit a high number of annihilations. As for fMRI, the PET scan let the neuroscientists know which areas of the brain are activated when the patient is performing an attention task.

Functional imaging and attention

Positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) have been extensively used to explore the functional neuroanatomy of cognitive functions. MEG imaging becomes to be used in the field as in [31]. In [32] a review of 275 PET and fMRI studies of attention type, perception, visual attention, memory, language, etc. are described. Depending of the setup and task a large variety of brain regions seem to be involved in attention and related functions (language, memory). This findings support again the idea that at the brain level, there are several attentions and their activity is largely distributed across almost all the brain. Attention goes from low-level to high level processing, from reflexes to memory and emotions and across all the human senses.


[1] Duchowski, Andrew. Eye tracking methodology: Theory and practice. Vol. 373. Springer Science & Business Media, 2007.

[2] Tobii eye tracking technology,

[3] SMI eye tracking technology,

[4] SR-Research eye tracking technology,

[5] Eyetribe low cost eye-trackers,

[6] Open source recording from several eye trackers,

[7] Open source eye-tracking for webcams,

[8] Xu, Pingmei, et al. “TurkerGaze: Crowdsourcing Saliency with Webcam based Eye Tracking.” arXiv preprint arXiv:1504.06755 (2015).

[9] Heatmapjs, javascript API,

[10] Simple Mouse Tracker,

[11] Mancas, Matei. “Relative influence of bottom-up and top-down attention.” Attention in cognitive systems. Springer Berlin Heidelberg, 2009. 212-226.

[12] Jiang, Ming, et al. “SALICON: Saliency in Context.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[13] Mediaanlyzer web site:

[14] Cadwell EEG,

[15] Natus EEG,

[16] Emotiv EEG,

[17] Müller, Matthias M., Thomas Gruber, and Andreas Keil. “Modulation of induced gamma band activity in the human EEG by attention and visual information processing.” International Journal of Psychophysiology 38.3 (2000): 283-299.

[18] Sauseng, Paul, et al. “A shift of visual spatial attention is selectively associated with human EEG alpha activity.” European Journal of Neuroscience 22.11 (2005): 2917-2926.

[19] Näätänen, R., Gaillard, A.W.K., and Mäntysalo, S., “Early selective-attention effect on evoked potential reinterpreted”, Acta Psychologica, 42, 313-329, 1978

[20] Sams, H., Paavilainen, P., Alho, K., and Näätänen, R., “Auditory frequency discrimination and event-related potentials”, Electroencephalography and Clinical Neurophysiology, 62, 437-448, 1985

[21] Näätänen, R., and Picton, T., “The N1 wave of the human electric and magnetic response to sound: a review and analysis of the component structure”, Psychophysiology, 24, 375-425, 1987

[22] Paavilainen, P., Alho, K., Reinikainen, K., Sams, M., and Näätänen, R., “Right hemisphere dominance of different mismatch negativities”, Electroencephalography and Clinical Neurophysiology, 78, 466-479, 1991

[23] Paavilainen, P., Karlsson, M.L., Reinikainen, K., and Näätänen, R., “Mismatch Negativity to change in spatial location of an auditory stimulus”, Electroencephalography and Clinical Neurophysiology, 73, 129-141, 1989

[24] Paavilainen, P., Jiang, D., Lavikainen, J., and Näätänen, R., “Stimulus duration and the sensory memory trace: An event-related potential study”, Biological Psychology, 35 (2), 139-152, 1993

[25] Aaltonen, O., Niemi, P., Nyrke, T., and Tuhkahnen, J.M., “Event-related brain potentials and the perception of a phonetic continuum”, Biological psychology, 24, 197-207, 1987

[26] Neville, H.J., and Lawson, D., “Attention to central and peripheral visual space in a movement detection task: an event-related potential and behavioral study. I. Normal hearing adults”, Brain Research, 405, 253-267, 1987

[27] Czigler, I., and Csibra, G., “Event-related potentials in a visual discrimination task: Negative waves related to detection and attention”, Psychophysiology, 27 (6), 669-676, 1990

[28] Alho, K., Woods, D.L., Alagazi, A., and Näätänen, R., “Intermodal selective attention. II. Effects of attentional load on processing of auditory and visual stimuli in central space”, Electroencephalography and Clinical Neurophysiology, 82, 356-368, 1992

[29] Crottaz-Herbette, S., “Attention spatiale auditive et visuelle chez des patients héminégligents et des sujets normaux : étude clinique, comportementale et électrophysiologique“, PhD Thesis, University of Geneva, Switzerland, 2001

[30] Desmedt, J.E., and Tomberg, C., “Mapping early somatosensory evoked potentials in selective attention: Critical evaluation of control conditions used for titrating by difference the cognitive P30, P40, P100 and N140”, Electroencephalography and Clinical Neurophysiology, 74, 321-346, 1989

[31] Downing, Paul, Jia Liu, and Nancy Kanwisher. “Testing cognitive models of visual attention with fMRI and MEG.” Neuropsychologia 39.12 (2001): 1329-1342.

[32] Cabeza, Roberto, and Lars Nyberg. “Imaging cognition II: An empirical review of 275 PET and fMRI studies.” Journal of cognitive neuroscience 12.1 (2000): 1-47.

Computational Attention Insights

What is attention? – Part 1: From philosophy to psychology

A short history of attention:

Human attention is an obvious phenomenon which is active during every single moment of awareness. It was studied first in philosophy, followed by experimental psychology, cognitive psychology, cognitive neuroscience and finally computer science for modeling. Those studies are not a serial experience, but they add the one to the others as the layers of an “attention onion” (Figure 1).

Fig. 1 Attention history: an accumulation of domains in onion layers.
Fig. 1 Attention history: an accumulation of domains in onion layers.

Due to the high diversity of applications of attention, a precise and general definition is not easy to find. Moreover, the views on attention evolved in time and research domains. In this first part of an attempt to get a definition of attention, we go through a brief history on the related research from philosophy to cognitive psychology. This first part addresses the times where the study of attention was more or less included in one single community.

Conceptual findings: attention in philosophy

A first important study on human attention was the one of N. Malebranche, a French oratorian priest who was also a philosopher. In his “The search after truth” published in 1675, Malebranche focused on the role of attention as a structuring system in scene understanding and though organization.

In the 18th century, G. W. Leibniz introduced the concept of “apperception” which refers to the fact of assimilating new and past experience into a new view of the world [1]. Leibniz intuition is about an involuntary approach to attention (known today as a “bottom-up”) which is needed for a perceived event to become conscious.

In the 19th century, Sir W. Hamilton, a Scottish metaphysician, changed the previous view on attention which consisted in thinking that humans can only focus on a single stimulus at once. Hamilton noted that when people throw marbles, the placement of only about seven of the marbles could be remembered [2]. This finding opened the way to the notion of “divided attention” and led about one century later to the famous paper of G.A. Miller, “The Magical Number Seven, Plus or Minus Two” in 1956 [3].

Attention in experimental psychology

After the first philosophical approaches, attention entered in a scientific phase when approached by psychology. Based on an observation error detected in astronomy, W. Wundt introduced the study of consciousness and attention to the field of psychology [4]. He interpreted this observation error as the time needed to switch voluntarily one’s attention from one stimulus to another and initiated a series of studies on the mental processing speed as the ones achieved by F. Donders [5].

At the second half of the 19th century, H. Von Helmholtz, in his “Treatise on Physiological Optics” [6] noted that despite the illusion that we see all our environment in the same resolution, humans need to move their eyes around the whole visual field “because that is the only way we can see as distinctly as possible all the individual parts of the field in turn.” Even if he personally mainly inspected the eye movement scanpath (overt attention), he also treated on the existence of a covert attention (which does not induce eye movements). Von Helmholtz focused on the role of attention as an answer to the question “Where” the objects of interest are.

In 1890, W. James, published his textbook “The principles of psychology” [7] and remarked that attention is closely related to consciousness and structure. According to James, attention makes people perceive, conceive, distinguish, remember, and shorten reactions time. James indeed linked attention to the notion of data compression and memory. Contrary to Von Helmholtz, James is more focused on the fact that attention should answer to the question of “What” are the objects of interest.

Attention in cognitive psychology

Between the very beginning of the 20th century and 1949, the mainstream approach in psychology was the behaviorism. During this period, the study of mind was considered as barely scientific and no important advances were achieved in the field of attention. Despite this “hole” in the study of attention we can still find names as J. R. Stroop who worked on the “Stroop Effect” [8] showing that divergent stimuli (reading/color) heavily impair people performance.

After the Second World War and its practical questions on soldiers’ attention and the development of the cognitivism, the study of attention made a tremendous comeback. To the behaviorist view which states that the organism behavior is under environmental control, the cognitivism showed that behavior can be modulated by attention.

The come-back of attention begun with the work of C. Cherry in 1953 on the famous “cocktail party” paradigm [9]. This approach models how do people select the conversation that they are listening to and ignore the rest? This problem was called “focused attention”, as opposed to “divided attention”.

  1. Broadbent [10] summarized most of the findings known until then in a “bottleneck” model in which he described the selection properties of attention. The idea is that attention acts like a filter (selector) of relevant information based on basic features, such as color or orientation. If the incoming information matches the filter it can reach awareness (conscious state), otherwise it will be discarded. At that time, the study of attention seemed to become very coherent and was called “early selection”. Nevertheless, after this short positive period, most of the findings summarized by Broadbent proved to be conflicting.

The first “attack” came from the alternative model of Deutsch and Deutsch [11] who used some properties of the cocktail-party paradigm to introduce a “late selection” model, where the attentional selection is basically a matter of memory processing and response-selection. The idea is that all information is acquired, but is selected to reach awareness only the one which fits semantic or memory-related objects. This is an opposite view to Broadbent who professes an early selection of the features before they reach any further processing.

New models were introduced like the attenuated filter-model of A. Treisman [12] which is a softer version than Broadbent bottleneck and which let stimuli with a response higher than a given threshold switch the filter, thus the focus of the selective attention.

Later, in 1980, Treisman and Gelade [13] proposed a new “feature integration” theory, where attention occurs in two distinct steps: a preattentive parallel effortless step which analyze objects and extracts features from those objects. In a second step, those features are combined to obtain a hierarchy of focus attention which pushes information towards awareness.

Despite its high importance within the psychology theories, the feature integration was also highly disputed. Other theories emerged as M. Posner [14] spotlight supporting a spatial selection approach or D. Kahneman [15] and his theory of capacity supporting the idea of mental effort.

In the late 1980s, a bunch of theories on attention flourished and none of them was capable of including all previous findings. According to H. Pashler [16], cognitive psychology reached a dead-end. After several decades of research in cognitive psychology, more questions were raised than answers given. Pashler declared that “No one knows what attention is” as a provocative response to the famous “Everyone knows what attention is” proposed by James one century before.

The need for new approaches: after the late 1980s “crisis”

Attention deals with the allocation of cognitive resources to important incoming information in order to bring them to a conscious state, update the scene model and memory and influence behavior. Between consciousness, memory and behavior, attention revealed to be much more complex that initially expected and some people even question the fact that attention is one single concept or there are several “attentions”. Sometimes, attention became a kind of magical box where everything which could not be explained otherwise can get.

The number of issues and the complexity of the nature of attention led to an interesting move in the split of attention study from one single community into two different communities.

One has a will of getting further into the theoretical and the profound nature of attention (cognitive neuroscience) using adapted simple stimuli. The arrival of advanced tools such as functional imaging or single-cell recordings will allow them to make huge steps towards attention understanding.

The second community working in the attention field has a will of making the concept work with real data such as images, videos or others (computer science).  From the late 1990s and the first computational models of visual attention those two approaches developed in parallel, one trying to get more insight on the biological brain and the other trying to get results which can predict eye behavior for real-life stimuli. Even if the computational attention community led to some models very different from what is known to happen in the brain, the engineers’ creativity is impressive and the results on real-life data begin to be significant and the applications endless.

In a second part of our attempt to know more about what attention is, we will focus on cognitive neuroscience on one side and on computational attention on the other along with the known properties of attention.


[1] Runes, Dagobert D., ed. The dictionary of philosophy. Citadel Press, 2001.

[2] Hamilton, William. Lectures on metaphysics and logic. Vol. 1. Gould and Lincoln, 1859.

[3] Miller, George A. “The magical number seven, plus or minus two: some limits on our capacity for processing information.” Psychological review 63.2 (1956): 81.

[4] Wundt, Wilhelm Max. Principles of physiological psychology. Vol. 1. Sonnenschein, 1904.

[5] Goldstein, E. Cognitive psychology: Connecting mind, research and everyday experience. Cengage Learning, 2014.

[6] von Helmholtz, Hermann. Treatise on physiological optics. Vol. 3. Courier Corporation, 2005.

[7] James, William. “The principles of psychology, Vol II.” (1913).

[8] Jensen, Arthur R., and William D. Rohwer. “The Stroop color-word test: A review.” Acta psychologica 25 (1966): 36-93.

[9] Cherry, E. Colin. “Some experiments on the recognition of speech, with one and with two ears.” The Journal of the acoustical society of America 25.5 (1953): 975-979.

[10] Broadbent, Donald Eric. “A mechanical model for human attention and immediate memory.” Psychological review 64.3 (1957): 205.

[11] Deutsch, J. Anthony, and Diana Deutsch. “Attention: some theoretical considerations.” Psychological review 70.1 (1963): 80.

[12] Treisman, Anne M. “Selective attention in man.” British medical bulletin (1964).

[13] Treisman, Anne M., and Garry Gelade. “A feature-integration theory of attention.” Cognitive psychology 12.1 (1980): 97-136.

[14] Posner, Michael I. “Attention in cognitive neuroscience: an overview.” (1995).

[15] Friedenberg, Jay, and Gordon Silverman. Cognitive science: an introduction to the study of mind. Sage, 2011.

[16] Pashler, Harold E., and Stuart Sutherland. The psychology of attention. Vol. 15. Cambridge, MA: MIT press, 1998.

Computational Attention Insights

Why computers should be attentive?

Any animal [1] from the tiniest insect [2] to humans is perfectly able to “pay attention”. Attention is the first step of perception: it analyses the outer real world and turns it into an inner conscious representation. Even during some dreaming phases known as REM (Rapid Eye Movements), the eye activity proves that the attentional mechanism is at work. But this time it analyses a virtual world coming from the inner subconscious and turns it into an inner conscious representation. Attention seems to be not only the first step of perception, but also the gate to conscious awareness.

The attentional process probably activates with the first developments of a complex sense (like auditory) which comes with the first REM dreams beginning after the sixth months of foetal development [3]. This mechanism is one of the first cognitive processes to be set up and factors like smoke, drugs, alcohol or even stress during pregnancy lead to later attention disorders and even higher chances to develop psychopathologies [4][5]. It is largely proven that for cognitive psychopathologies, the attentive process is highly affected (like in autism or schizophrenia) mainly by studying eye tracking traces which can be very different between patients and the control groups [6][7]. The attentive process is set up as early as the prenatal period when it already begins to operate during babies dreams. Until death it occurs in every single moment of the day when people are awake, but also during their dreams. This shows the importance of attention: it cannot be dissociated from perception and consciousness. Even when the person is sleeping without dreaming and the eyes are not moving, important stimuli can “wake up” a person. Attention is never turned off, it can be only lowered and in standby (excepting drug-induced states when the consciousness is altered or eliminated as in artificial coma). It is thus safe to say that if there is conscious life in a body capable to act on its environment, there is attention.

As a gate of conscious awareness at the interface between inner and outer, attention can be both conscious (attentive) and unconscious (pre-attentive) and it is the key to survival. Attention is also a sign of limited computation capabilities. Vision, audition, touch, smell or taste, they all provide the brain with a huge amount of information. Gigabits of rough sensorial data flow every second into the brain which cannot physically handle such an information rate. Attention provides the brain with the capacity of selecting the main information and building priority tasks. While there are a lot of definitions and views of attention the one core idea which justifies attention regardless the discipline, methodology or intuition is “information reduction” [8].

Attention only begun to be seriously studied from the 19th century with the arrival of modern psychology. Some thoughts about the attention concepts may be found in Descartes, but no rigorous and intensive scientific study was done until the beginning of psychology. How the philosophers missed such a key concept as attention from the antic times to almost now? Part of the answer is given by William James, the father of psychology, in his famous definition of attention: “Everybody knows what attention is”. Attention is so natural, so linked to life and partly unconscious, so obvious that … nobody really noticed it until recently.

However, little by little, a new transversal research field appeared around the concept of “attention” gathering first psychologists, than neuroscientists and even since the end of the nineties’ engineers and computer scientists. While covering the whole research on attention needs a whole series of books, the topic is here narrowed to focus on attention modelling, a crucial step towards wider artificial intelligence.

Indeed, this key process of attention is currently rarely used within computers. As with the brain, a computer is a processing unit. As with the brain it has limited computation capabilities and memory. As with the brain, computers should analyse more and more data. But unlike the brain they do not pay attention. While a classical computer will be more precise in quantifying the whole input data, an attentive computer will focus on the most “interesting” data which has several advantages:

  • It will be faster and more efficient in terms of memory storage due to its ability to process only part of the input data.
  • It will be able to find regularities and irregularities in the input signal and thus be able to detect and react to unexpected or abnormal events.
  • It will be able to optimize data prediction by describing novel patterns, and depending on the information reduction result (how efficient the information reduction was), it will be capable of being curious, bored or annoyed. This curiosity which constantly pushes to the discovery of more and more complex patterns to better reduce information is a first step towards creativity.

As in humans attention is the gate to awareness and consciousness, in computers attention can lead to novel emergent computational paradigms beyond classical pre-programmed machines. While the way towards self-modifying computers is still very long ahead, computational attention develops in an exponential way letting more and more applications benefit from it.


[1] Zentall, Thomas R. “Selective and divided attention in animals.” Behavioural Processes 69.1 (2005): 1-15.
[2] Hoy, Ronald R. “Startle, categorical response, and attention in acoustic behavior of insects.” Annual review of neuroscience 12.1 (1989): 355-375.
[3] Hopson, Janet L. “Fetal psychology.” Psychology Today 31.5 (1998): 44.
[4] Mick, Eric, et al. “Case-control study of attention-deficit hyperactivity disorder and maternal smoking, alcohol use, and drug use during pregnancy.” Journal of the American Academy of Child & Adolescent Psychiatry 41.4 (2002): 378-385.
[5] Linnet, Karen Markussen, et al. “Maternal lifestyle factors in pregnancy risk of attention deficit hyperactivity disorder and associated behaviors: review of the current evidence.” American Journal of Psychiatry 160.6 (2003): 1028-1040.
[6] Holzman, Philip S., et al. “Eye-tracking dysfunctions in schizophrenic patients and their relatives.” Archives of general psychiatry 31.2 (1974): 143-151.
[7] Klin, Ami, et al. “Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism.”Archives of general psychiatry 59.9 (2002): 809-816.
[8] Itti, Laurent, Geraint Rees, and John K. Tsotsos, eds. Neurobiology of attention. Academic Press, 2005.