Computational Attention Insights

Attention in computer science – Part 1

Numediart Institute, Faculty of Engineering (FPMs), University of Mons (UMONS) Matei Mancas, 31 Bd. Dolez, 7000 Mons, Belgium

Idea and approaches. As we already saw, attention is a topic which was taken into account by philosophy first, it was than discussed by cognitive psychology and neuroscience and, only in the late nineties, attention modeling arrived in the domain of computer science and engineering. In this domain, two main approaches can be found. The first one is based on the notion of “saliency”, while the second one on the idea of “visibility”. In reality, the models based on saliency are by far more spread than the visibility models in computer science. The notion of “saliency” implies a competition between “bottom-up” or exogenous and “topdown” or endogenous information. The idea of bottom-up saliency maps is that the sight of people will direct to areas which, in some way, stand out from the background based on novel or rare features. This bottom-up saliency can be modulated by top-down information based on memory, emotions or goals. The eye movements (scan paths) can be computed from the saliency map which remains the same during eye motion: it is a global static attention (saliency) map which only provides, for each pixel, a probability to attract human gaze.

Visibility models. These models of human attention assume that people attend locations that maximize the information acquired by the eye (the visibility) to solve a given task (which can also be simply free viewing). In this case top-down information is naturally included in the notion of task along with the dynamic bottom-up information maximization. The eye movements are in this approach directly an output from the model and do not have to be inferred from a “saliency map” which is considered as a surface giving the posterior probability (following each fixation) that the target is at each scene location Geisler & Cormack (2011). Compared to other Bayesian frameworks, like the one of Oliva et al. (2003), visibility models have one main difference. The saliency map is dynamic: indeed visibility models make explicit the resolution variability of the retina (Figure 1): in that way an attention map is “re-computed” at each new fixation, as the feature visibility changes at each of these fixations. Tatler (2007) introduces a tendency of the eye gaze to stay in the middle of the scene to maximize the visibility over the image (which reminds the centered preference for natural images also called centered Gaussian bias.


Figure 1: Depending on the eye fixation position, visibility thus feature extraction is different. Adapted from images by Jeff Perry.

The visibility models are much more used in the case of strong tasks (like Legge et al. (2002) who proposed a visibility model capable to predict the eye fixations during the task of reading) and few of them are applied to free viewing which is considered as a week task Geisler & Cormack (2011).

Saliency approaches: bottom-up methods. While visibility models are more used in cognitive sciences and with strong tasks, in computer science, bottom-up approaches use features extracted only once from the signal independently from the eye fixations mainly for free-viewing. Features are extracted from the image, such as luminance, color, orientation, texture, objects relative position or even simply neighborhoods or patches. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for contrasted, rare, surprising, novel, worthy to learn, less compressible, maximizing the information areas. All those definitions are actually synonyms and they all amount to searching for some unusual features in a given spatial context. In the following, we provide examples of contexts used for still images to obtain a saliency map. This saliency map can be visualized as a heatmap where hot colors represent pixels with a higher probability to attract human gaze (Figure 2).


Figure 2: Left: initial image. Right: superimposed saliency heatmap on the initial image. The saliency map is static and gives an overview of where the eye is likely to attend.

Saliency methods for still images. The literature is very active concerning still images saliency models. Those models have various implementations and technical approaches even if initially they all derive from the same idea. It is not the purpose here to provide a review of all those models, but we instead propose a taxonomy to classify those models. We structure this taxonomy of saliency methods on the context that those methods take into account to exhibit image novelty. In this framework, there are three classes of methods.

The first one focuses on pixel’s surroundings: here a pixel, a group of pixels or a patch is compared with its surroundings at one or several scales. The main idea is to compute visual features at several scales in parallel, to apply center-surround inhibition, combination into conspicuity maps (one per feature) and finally to fuse them into a single saliency map. There are a lot of models derived from this approach which mainly use local center-surround contrast as a local measure of novelty. A good example of this family of approaches is the Itti’s model Itti et al. (1998) which is the first implementation of the Koch and Ullman model. This implementation proved to be the first successful approach of attention computation by providing better predictions of the human gaze than chance or simple descriptors like entropy.

A second class of methods will use as a context the entire image and compare pixels or patches of pixels with other pixels or patches from other locations in the image but not necessarily in the surroundings of the initial patch. The idea can be divided in two steps. First, local features are computed in parallel from a given image. The second step measures the likeness of a pixel or a neighborhood of pixels to other pixels or neighborhoods within the image. A good example can be found in Seo & Milanfar (2009) which first proposes to use local regression kernels as features. Second it uses a nonparametric kernel density estimation for such features, which results in a saliency map of local “self-resemblance” measure. Mancas (2009) and Riche et al. (2013) focus on the entire image. These models are designed to detect saliency in the areas which are globally rare and locally contrasted. Boiman & Irani (2007) look for similar patches and relative positions of these patches in an image.

Finally, the third class of methods will take into account a context based on a model of what the normality should be: if things are not like they should be, this can be surprising, thus attract people attention. Achanta et al. (2009) proposed a very simple attention model: a distance is computed between a smoothed version of the input image and the average color vector of the input image. The average image is used as a kind of model of the image statistics: pixels which are far from those statistics are more salient. This model is mainly useful in salient objects detection. Another approach to “normality” can be found in Hou & Zhang (2007), where the authors proposed a spectral model that is independent of any features. The difference between the log-spectrum of the image and its smoothed log-spectrum (spectral residual) is reconstructed into a saliency map. Indeed, a smoothed version of the log-spectrum is closer to a a f1  decreasing log-spectrum template of normality as small variations are removed. This approach is almost as simple as Achanta et al. (2009) but more efficient in predicting eye fixations.

Towards video, audio or 3D signals and top-down attention. In the next parts we will focus on other kind of signals such as moving images (video), audio or even 3D signals. In addition, even if the top-down information is less modeled for saliency approaches, there is anyway an important literature linked to the topic which will also be detailed in the next parts.

Achanta, R., Hemami, S., Estrada, F. & Susstrunk, S. (2009). Frequency-tuned Salient Region Detection, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). URL:
Boiman, O. & Irani, M. (2007). Detecting irregularities in images and in video, International Journal of Computer Vision 74(1): 17–31.
Geisler, W. S. & Cormack, L. (2011). Chapter 24: Models of Overt Attention, in The Oxford handbook of eye movements, Oxford University Press.
Hou, X. & Zhang, L. (2007). Saliency detection: A spectral residual approach, Proc. IEEE Conf. Computer Vision and Pattern Recognition CVPR ’07, pp. 1–8.
Itti, L., Koch, C. & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11): 1254 –1259.
Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.chips 2002: new insights from an idealobserver model of reading, Vision Research pp. 2219–2234.
Mancas, M. (2009). “relative influence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.
Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.
Riche, N., Mancas, M., Duvinage, M., Mibulumukini, M., Gosselin, B. & Dutoit, T. (2013). Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis, Signal Processing: Image Communication 28(6): 642–658.
Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL:
Tatler, B. (2007). The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions, Journal of Vision 7.

Computational Attention Insights

How to measure attention?

There are a lot of ways to measure attention. Some, mainly in psychology, are more qualitative and use questionnaires and their interpretation. Some are quantitative but they focus on the participants feedback (button press, click, etc…) when they see/hear/sense a stimulus.

Here we focus on quantitative techniques which provide fine-grain information about the attentive responses. The attentive response can be either measured directly in the brain, or indirectly through the participants’ eye behavior. Only one of the techniques which are described here is based on participant active feedback: mouse tracking. This is because the mouse tracking feedback is very close to the one of the eye-tracking and this is an emerging approach of interest for the future: it needs less time, less money and provide more data than classical eye-tracking.

Eye-tracking: an indirect cue about covert attention

The use of an eye-tracker is probably the most widely used tool for attention measurement. The idea is to use a device which is able to precisely measure the eyes gaze which obviously only provide information concerning covert attention.

The eye-tracking technology highly evolved during time. Different technologies are described in [1]. One of the first techniques is the EOG (Electro-OculoGraphy). The idea is to measure the skin electric potential around the eye which give the eye direction relative to the head. This issue implies that for a complete eye-tracking system the head must either be attached to a still system or a head tracker system must be used in addition to the EOG. In order to get more precise results, special lenses can be used instead of EOG, but in this case the technique is more invasive and it also only provides the eye direction relative to the head and not the eye gaze as an intersection with a screen for example.

The technique that most of the current commercial and research solutions use is based on the video detection of pupil/corneal reflection. Indeed, an infra-red source sends the light towards the eyes. The light is reflected by the eye and the position of the reflection is used to compute the gaze direction.

While the technique is most of the time the same, the embodiment of the eye-tracker can be very different. The main eye-tracking manufacturers propose the system under different forms [2][3][4].

  1. Some eye-trackers are directly included into the screen which is used to present the data. This setup has the advantage of a very short calibration, but it can only be used with its own screen.
  2. Separate cameras need some additional calibration time but the tests can be done on any screen and even in a real scene by using a scene camera.
  3. The eye-tracking glasses can be used in a very ecological setup, even outside on real-life scenes. An issue of those systems is that it is not easy to aggregate the data from several viewers as the scene which is viewed is not the same. The aggregation needs a non-trivial registration of the scenes.
  4. Cheap devices begin to appear and quite precise cameras are sold less than 100 EUR [5] which is a fraction of the price of a professional eye-tracker. An issue with those eye-trackers is that they are sold with minimal software and it is often difficult to synchronize the stimuli and the related recorded data. Those eye-trackers are mostly used as real-time human-machine interaction devices. Nevertheless, open source projects exist which allow to record data from low cost eye-trackers like Ogama [6].
  5. Finally, webcam-based software is freely available [7]. They are able to provide good quality data and to be used remotely with existing webcams [8].

Mouse-tracking: the low-cost eye-tracking

If eye tracking is the most reliable ground truth in the study of covert visual attention, it needs a good practice for the operator, it has some mandatory constraints for the user (the head might be attached, the calibration process may be long), and it needs a complex system which has a certain cost.

A much simpler way to acquire data about visual attention may be the use of mouse tracking. The mouse can be precisely followed while an Internet browser is opened by using a client-side language like JavaScript. The mouse precise position on the screen can be either captured using a home-made code or some existing libraries like [9][10]. This technique may appear as not very reliable; however, all depends on the context of the experiment.

  1. The first case is the one where the experiment is hidden to the participant: the participant is unaware about the fact that the mouse motion is recorded. In this case the mouse motion is not accurate enough. Indeed there is no automatic following of the eye gaze by the hand even if a tendency of the hand (and consequently the mouse) to follow the gaze is visible. Sometimes the mouse is only used to scroll a page and the eyes are very far from the mouse pointer for example.
  2. The second case is the one where the participant is aware about the experiment and he has a task to follow. This can go from a simple “point the mouse where you look” instruction as in [11] with the first use of mouse tracking for saliency evaluation to more recent approaches as the one of SALICON in [12] where multi-resolution interactive pointing mimicking the fovea resolution is used to push people to point the mouse curser where they look.

In this second case where the participant is aware about his mouse motion tracking, the results of mouse tracking are very close to eye-tracking as shown by Egner and Scheier on their website [13]. However, some unconscious eye movements may be missed, but is this really an issue?

The main advantages of mouse tracking are low price and the complete transparency for the users (they just move a mouse pointer).

However, mouse tracking has several drawbacks:

  • The first place where the mouse pointer is located is quite important as the observer may look for the pointer. Should it be located outside the image or in the centre of the image? Ideally, the pointer should initially appear randomly in the image to avoid introducing a bias of the initial position of the pointer.
  • Mouse tracking only highlights areas which are consciously important for the observer. This is more a theoretical drawback as in practice, one should try to predict the conscious interesting regions.
  • The pointer hides the image region it overlaps, thus the pointer position is never on the important areas but very close to them. This drawback may be partially eliminated by the low-pass filter step performed after the mean of the whole observer set. It is also possible to make a transparent pointer as in [12].

EEG: Get the electric activity from the brain

The EEG technique (ElectroEncephaloGraphy) uses electrodes which are located on the participant scalp. Those electrodes amplify the electrical waves coming from the brain. An issue of this technique is that the skull and scalp attenuates those electrical waves.

While classical research setups have a high number of electrodes with manufacturers like [14][15], some low-cost commercial systems like Emotiv [16] are more compact and easy to install and calibrate. While the latter are easier to use, they are obviously less precise.

EEG studies provided interesting results as the modulation of the gamma band [17] during selective visual attention. Other papers [18] also provide cues about the alpha band modification during attentional shifts.

One very important cue about attention which can be measured using EEG is the P300 event-related potential.

The work of Näätänen et al. [19] in 1978 on the auditory attention provided evidences that the evoked potential has an improved negative response when the subject was presented with rare stimuli compared to frequent ones. This negative component is called the mismatch negativity (MMN), and it was observed in several experiments. The MMN occurs 100 to 200 ms after the stimuli, a time which is perfectly in the range of the pre-attentive attention phase.

Depending on the experiments, different auditory features were isolated: audio frequency [20], audio intensity [19][21][22], spatial origin [23], duration [24] and phonetic changes [25]. All these features were not salient alone, but saliency was induced by the rarity of each one of these features.

The study of the MMN signal for visual attention was led several times in conjunction with audio attention [26][27][28]. But a few experiments were made using only the visual stimuli. Crottaz-Herbette led in her thesis [29] an experiment in the same conditions as for auditory MMN in order to find out if a visual MMN really exists. The result was clearly positive with a high increase of the negativity of the evoked potential when seeing rare stimuli compared with the evoked potential when seeing frequent stimuli. The visual MMN occurs from 120 to 200 ms seconds after the stimulus. The 200 ms frontier strangely matches with the 200 ms needed to initiate a first eye movement, thus to initiate the “attentive” serial attentional mechanism. As for the audio MMN detection, no specific task was asked of the subject who only had to see the stimuli, this MMN component is thus pre-attentive unconscious and automatic.

This study and others [30] also suggest the presence of a MMN response for the somesthesic modality (touch, taste, etc…)

The MMN seems to be a universal component illustrating the brain reaction to an unconscious pre-attentive process. Any unknown stimulus (novel, rare) will be very salient as measured by P300 as the brain will try to know more about it. Rarity is the major engine of the attentional mechanism for visual, auditory and all the other signals acquired from our environment.

Functional imaging: fMRI

The MRI stands for Magnetic Resonance Imaging. The main idea behind this kind of imaging system is that human body is mainly made of water which is itself composed of hydrogen atoms composed of a single proton. Those protons have a magnetic moment (spin) which is randomly oriented most of the time. The MRI device will set up a very high magnetic field which will have as consequence to align the magnetic moment of the protons of the patient body. Radio Frequency (RF) impulsions orthogonal to the initial magnetic field push the protons to align to this new impulsion and they will align back to the initial magnetic field while releasing RF waves. Those waves are captured and they help in constructing an image where clear gray levels mean that there are more protons, therefore, more water in the body parts (like in fat for example) and a darker gray level reveal regions with less water like bones for example.

MRI is initially an anatomical imaging technique, but there is a functional version called fMRI using the BOLD approach. In this case a substance which has magnetic properties is injected into the blood. If a body part or a region of the brain is in its basal activity state, then the substance keeps its initial composition. If the blood pressure is higher with more oxygen (activated state), then the substance composition will change and the magnetic response to MRI will be much higher. In that way, when a region in the brain, for example, is activated, then the blood will have an increased flow and the activated state will push to a high response. fMRI imaging is thus capable of detecting the areas in the brain which are active and to become a great tool for neuroscientists which can visualize which area in the brain is activated during an attention-related patient exercise.

Functional imaging: MEG

MEG stands for MagnetoEncephaloGraphy. The idea is simple: while the EEG detects the electrical field which is heavily distorted when traversing the skull and skin, MEG detects the magnetic field induced by this electrical activity. The magnetic field has the advantage not to be influenced by the skin or the skull. While the idea is simple, in practice the magnetic field is very low which makes it very difficult to measure. This is why the MEG imaging is relatively new: the technological advances let the MEG be more effective based on SQUID (Superconducting Quantum Interference Devices). The magnetic field of the brain can induce electricity in a superconducting device which can be precisely measured. Modern devices have spatial resolutions of 2 millimetres and temporal resolutions of some milliseconds. Moreover, MEG images can be superimposed on MRI anatomic images which help in rapidly localise the main active areas. Finally, participants to MEG imaging can have a sit position which is more natural during exercises than the horizontal position of fMRI or PET scan.

Functional imaging: PET Scan

As for fMRI, PET scan (Positron Electron Tomography) is also a functional imaging and it can thus produce also a higher signal in case of brain activity. The main idea of PET scan is that the substance which is injected to the patient releases positrons (anti-electrons which are particles of the same properties as an electron but with positive charges). Those positrons will almost instantaneously meet an electron and have a very exo-energetic reaction (called annihilation). This annihilation will transform the whole mass of the two particles into energy and release to gamma photons in two opposite directions which will be detected by the scanner sensors. The substance which is injected will go and fixate on the areas of the brain which are the most active, which means that those areas will exhibit a high number of annihilations. As for fMRI, the PET scan let the neuroscientists know which areas of the brain are activated when the patient is performing an attention task.

Functional imaging and attention

Positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) have been extensively used to explore the functional neuroanatomy of cognitive functions. MEG imaging becomes to be used in the field as in [31]. In [32] a review of 275 PET and fMRI studies of attention type, perception, visual attention, memory, language, etc. are described. Depending of the setup and task a large variety of brain regions seem to be involved in attention and related functions (language, memory). This findings support again the idea that at the brain level, there are several attentions and their activity is largely distributed across almost all the brain. Attention goes from low-level to high level processing, from reflexes to memory and emotions and across all the human senses.


[1] Duchowski, Andrew. Eye tracking methodology: Theory and practice. Vol. 373. Springer Science & Business Media, 2007.

[2] Tobii eye tracking technology,

[3] SMI eye tracking technology,

[4] SR-Research eye tracking technology,

[5] Eyetribe low cost eye-trackers,

[6] Open source recording from several eye trackers,

[7] Open source eye-tracking for webcams,

[8] Xu, Pingmei, et al. “TurkerGaze: Crowdsourcing Saliency with Webcam based Eye Tracking.” arXiv preprint arXiv:1504.06755 (2015).

[9] Heatmapjs, javascript API,

[10] Simple Mouse Tracker,

[11] Mancas, Matei. “Relative influence of bottom-up and top-down attention.” Attention in cognitive systems. Springer Berlin Heidelberg, 2009. 212-226.

[12] Jiang, Ming, et al. “SALICON: Saliency in Context.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[13] Mediaanlyzer web site:

[14] Cadwell EEG,

[15] Natus EEG,

[16] Emotiv EEG,

[17] Müller, Matthias M., Thomas Gruber, and Andreas Keil. “Modulation of induced gamma band activity in the human EEG by attention and visual information processing.” International Journal of Psychophysiology 38.3 (2000): 283-299.

[18] Sauseng, Paul, et al. “A shift of visual spatial attention is selectively associated with human EEG alpha activity.” European Journal of Neuroscience 22.11 (2005): 2917-2926.

[19] Näätänen, R., Gaillard, A.W.K., and Mäntysalo, S., “Early selective-attention effect on evoked potential reinterpreted”, Acta Psychologica, 42, 313-329, 1978

[20] Sams, H., Paavilainen, P., Alho, K., and Näätänen, R., “Auditory frequency discrimination and event-related potentials”, Electroencephalography and Clinical Neurophysiology, 62, 437-448, 1985

[21] Näätänen, R., and Picton, T., “The N1 wave of the human electric and magnetic response to sound: a review and analysis of the component structure”, Psychophysiology, 24, 375-425, 1987

[22] Paavilainen, P., Alho, K., Reinikainen, K., Sams, M., and Näätänen, R., “Right hemisphere dominance of different mismatch negativities”, Electroencephalography and Clinical Neurophysiology, 78, 466-479, 1991

[23] Paavilainen, P., Karlsson, M.L., Reinikainen, K., and Näätänen, R., “Mismatch Negativity to change in spatial location of an auditory stimulus”, Electroencephalography and Clinical Neurophysiology, 73, 129-141, 1989

[24] Paavilainen, P., Jiang, D., Lavikainen, J., and Näätänen, R., “Stimulus duration and the sensory memory trace: An event-related potential study”, Biological Psychology, 35 (2), 139-152, 1993

[25] Aaltonen, O., Niemi, P., Nyrke, T., and Tuhkahnen, J.M., “Event-related brain potentials and the perception of a phonetic continuum”, Biological psychology, 24, 197-207, 1987

[26] Neville, H.J., and Lawson, D., “Attention to central and peripheral visual space in a movement detection task: an event-related potential and behavioral study. I. Normal hearing adults”, Brain Research, 405, 253-267, 1987

[27] Czigler, I., and Csibra, G., “Event-related potentials in a visual discrimination task: Negative waves related to detection and attention”, Psychophysiology, 27 (6), 669-676, 1990

[28] Alho, K., Woods, D.L., Alagazi, A., and Näätänen, R., “Intermodal selective attention. II. Effects of attentional load on processing of auditory and visual stimuli in central space”, Electroencephalography and Clinical Neurophysiology, 82, 356-368, 1992

[29] Crottaz-Herbette, S., “Attention spatiale auditive et visuelle chez des patients héminégligents et des sujets normaux : étude clinique, comportementale et électrophysiologique“, PhD Thesis, University of Geneva, Switzerland, 2001

[30] Desmedt, J.E., and Tomberg, C., “Mapping early somatosensory evoked potentials in selective attention: Critical evaluation of control conditions used for titrating by difference the cognitive P30, P40, P100 and N140”, Electroencephalography and Clinical Neurophysiology, 74, 321-346, 1989

[31] Downing, Paul, Jia Liu, and Nancy Kanwisher. “Testing cognitive models of visual attention with fMRI and MEG.” Neuropsychologia 39.12 (2001): 1329-1342.

[32] Cabeza, Roberto, and Lars Nyberg. “Imaging cognition II: An empirical review of 275 PET and fMRI studies.” Journal of cognitive neuroscience 12.1 (2000): 1-47.

Computational Attention Insights

What is attention? – Part 2: From neuroscience to computer science

Attention: the technology comes in

After the 1980th “crisis” in attention research, two different communities appeared in the study of attention with the arrival of tools providing new insights on brain behavior and with the increasing power of computers. One community deals with cognitive neuroscience and it intends, along with the cognitive psychology, to understand the deep mechanisms of attention, while the other community focuses on engineering and computer science and its goal is to develop attention models to be applied in signal processing and especially in image processing (Figure 1).

Fig. 1 Attention history: an accumulation of domains in onion layers
Fig. 1 Attention history: an accumulation of domains in onion layers

The arrival of new techniques and computational capacities brought fresh air (and results) in the study of attention.

Attention in cognitive neuroscience

Cognitive neuroscience arrived with a whole set of new tools and methods. If some of them were already used in cognitive psychology (EEG, eye-tracking devices …) others are new tools providing new insights on brain behavior:

  • Psychophysical methods: scalp recording of EEG (electroencephalography: measures the electric activity of the neurons) and MEG (Magnetoencephalography: measures avec the magnetic activity of the neurons) which are complementary in terms of sensitivity on different brain areas of interest.
  • Neuroimaging methods: functional MRI and PET scan images, which both measure the areas in the brain which have intense activity given a task that the subject executes (visual, audio …).
  • Electrophysiological methods: single-cell recordings which measure the electro-physiological responses of a single neuron using a microelectrode system. While this system is much more precise, it is also more invasive.
  • Other methods: TMS (transcranial magnetic stimulation which can be used to stimulate a region of the brain and to measure the activity of specific brain circuits in humans) and multi-electrodes technology which allows the study of the activity of many neurons simultaneously showing how different neuron populations interact and collaborate.The first and most well-known model is the one by Desimone and Duncan on biased competition [1]. The central idea is that at any given moment, there is more information in the environment than can be processed. Relevant information always competes with irrelevant information to influence behavior. Attention biases this competition, increasing the influence of behavior-relevant information and decreasing the influence of irrelevant information. Desimone explicitly suggest a physiologically plausible neural basis that mediates this competition for the visual system. A receptive field of the neuron is a window to the outside world. It reacts only to stimuli in this window and is insensitive to stimulation in other areas. The authors assume, that the competition between stimuli takes place if more than one stimulus share the same receptive field. This approach is very interesting as each neuron can be seen as a filter by itself and the neurons receptive field can be from very small and precise (like in the visual cortex V1) to very large which focus on entire objects (like IT brain area). This basic idea confirms different approaches of attention (location-based, feature-based, object-based, attentional bottleneck) in a very natural and elegant way. Moreover, a link is achieved with memory based on the notion of attentional templates in working memory which enhances neurons response depending on previous acquired data.While cognitive neuroscience brought a lot of new information to cognitive psychology, still the attention process is far from being fully understood and a lot of work is undergoing in the field. A second family of models was setup by Laberge in late 1990s [2]. It is a structural model based on neuropsychological findings and data from neuroimaging studies. Laberge conjectures that at least three brain regions are concurrently involved in the control of attention: frontal areas, especially the prefrontal cortex; thalamic nuclei, especially the pulvinar and posterior sites, the posterior parietal cortex and the interparietal sulcus. Laberge proposes that these regions are necessary for attention and all these regions presumably give rise to attentional control together. Using those techniques, two main families of theories raised.

Attention in computer science

While the cognitive neuroscience focuses on researching the very nature of attention, a different angle is approached in the 1980s with the developments of computational power. Building on Treisman and Gelade feature integration theory [3] C. Koch and S. Ullman [4] proposed that the different visual features that contribute to attentive selection of a stimulus (color, orientation, movement, etc.) are combined into one single topographic map, called the ”saliency map”. This one integrates the normalized information from the individual feature maps into one global measure. Bottom-up saliency is determined by how different a stimulus is from its surround at several scales. The saliency map provides the probability, for each region in the visual field, to be attended. This saliency map concept is close to that of the “master map” postulated in the feature integration theory by Treisman and Gelade.

The first computational implementation of Koch and Ullman architecture was achieved by Laurent Itti in his seminal work [5]. This very first computational implementation of an attention system takes as an input any image and outputs a saliency map of this image and also the winner-take-all-based mechanism simulating the eye fixations during selective attention. From that point, hundreds of models developed first for images, than for videos and some of them for audio or even 3D data very recently.

From the initial biologically-inspired models a bunch of models based on mathematics, statistics or information theory arrived on the “saliency market” predicting better and better human attention. They are all based on features extracted from the signal (most of the time low-level features but not always), such as luminance, color, orientation, texture, motion, objects relative position or even simply neighborhoods or patches from the signal. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for “contrasted, rare, surprising, novel, worthy-to-learn, less compressible, maximizing the information” areas. All those words are actually synonyms and they all amount to searching for some unusual features in a given context. This context can be local (typically center-surround spatial or temporal contrasts), global (whole image or very long temporal history), or it can be a model of normality (the image average, the image frequency content). Very recently learning is more and more involved into saliency computation: first it was mainly about adjusting model coefficients given a precise task, now complex classifiers like deep neural networks begin to be used to both extract the features from the signal and train the most salient features based on ground truth obtained with eye-tracking or mouse-tracking data.

So … what is attention?

The trans-domain nature of attention naturally lead to a lot of different definitions. Attention deals with the allocation of cognitive resources to important incoming information in order to bring them to a conscious state, update a scene model, update the memory and influence the behavior. But several attention mechanism were highlighted especially from Cherry’s cocktail party issue. A dichotomy appeared between divided attention and selective attention. From there, a clinical model of attention divided into five different “kinds” appeared. One can also talk about different kinds of attention when it needs the eye focus or not, or when it uses only the image features or also the memory and emotions… While its purpose seems to be the relation between the outer world and inner consciousness, memory and emotions, the clinical manifestation of attention tends to show that there might be several attentions.

Overt vs. covert: the eye

Overt versus covert attention is an attention property which was found at the very beginning of the psychological studies of attention. Overt attention is the one which can be exhibited by eyes activity, or more generally by focus of attention. Covert attention does not induce eye movements or a specific focus: it is the ability to catch (and thus be able to bring to consciousness) regions of an image which are not fixated by the eyes. The eye achieves mainly 3 types of movements which are dues to the non-linear repartition of receptive cells (cones and rods) on the retina. The cones which provide a high resolution and color are mainly concentrated in the middle of the retina in a region called “fovea”. This means that in order to acquire a good spatial resolution of an image the eye must gaze towards this precise area to align it on the fovea. This constraint led to mainly three types of the eye movements are the followings:

  1. Fixations: the gaze stays a minimal time period on approximately the same spatial area. The eye gaze is never still. Even when gazing a specific location, micro-saccades can be detected. The micro-saccades are very small movements of the eye during area fixations.
  2. Saccades: the eyes have a ballistic movement between two fixations. They disengage from one fixation and they are very rapidly shifted to the second fixation. Between the two fixations, no visual data is acquired.
  3. Smooth pursuit: a smooth pursuit is a fixation … on a moving object. The eye will follow a moving object to maintain it in the fovea (central part of the retina). During smooth pursuits, more brutal small correction can be done in case of eye retina movement. This smooth pursuit with small corrections is called a nystagmus.

Modelling covert attention will predict human fixations and the prediction of the dynamical path of the eye (called the eye “scanpath”).

Serial vs. parallel: the cognitive load

While focused, sustained and selective attention deal with a serial processing of information, alternating and divided attention deal with parallel processing of several tasks. These facts show that attention can deal with information both serially and in parallel. While there is a limit of the number of tasks which are processed in parallel during divided attention (around 5 tasks), in the case of pre-attentive processing, massively parallel computation can be done. Some notions as the gist [6] seem to be very fast and able to process (very roughly) the entire visual field to get a first idea about the context of the environment. The five kinds of attentions follow a hierarchy based on the degree of focus, thus the cognitive load which is needed to achieve the attentive task. This approach is sometimes called the clinical model of attention.

  1. Focused attention: respond to specific stimuli (focus on a precise task).
  2. Sustained attention: keep a consistent response during longer continuous activity (stay attentive a long period of time and follow the same topic).
  3. Selective attention: selectively maintain the cognitive resource on specific stimuli (focus only on a given object while ignoring distractors).
  4. Alternating attention: switch between multiple tasks (stop reading to watch something).
  5. Divided attention: deal simultaneously with multiple tasks (talking while driving).

Bottom-up vs. top-down: the memory and actions

Another fundamental property of attention needs to be taken into account: attention is a mix of two components called bottom-up (or exogenous) and top-down (or endogenous) components. The bottom-up component is reflex-based and uses the acquired signal. Attention is attracted by the novelty of some features in a given context (spatial local: there is a contrasted region, spatial global: there is a red dot while all the other are blue, temporal: there is a slow motion while before motion was fast…). Its main purpose is to alert in case of unexpected or rare situations and it is tightly related to survival. This first component of attention is the one which is the best modeled in computer science as the signal features are objective cues which can be easily extracted in a computational way.

The second component of attention (top-down) deals with individual subjective feelings. It is related to memory, emotions and individual goal. This component of attention is less easy to model in a computational way as it is more subjective and it needs to have cues about individual goals, a priori knowledge or emotions. Top-down attention can be itself divided into two sub-components:

  1. Goal/Action-related: depending on an individual current goal, certain features or locations are inhibited and other receive more weight. The same individual with the same prior knowledge responds differently to the same stimuli when the task in hand is different. This component is sometimes called “volitional”.
  2. Memory/Emotion-related: this process is related to experience and prior knowledge (and the emotions related to them). In this category one can find the scene context (experience from previously viewed scenes with similar spatial layouts or similar motion behavior) or object recognition (you see your grandmother first in the middle of other people). This component of attention is more “automatic”, it does not need an important cognitive load and it can come in addition to volitional attention. In the other direction the volitional top-down attention cannot inhibit the memory-related attention which will still work even if a goal is present or not. More generally, bottom-up attention cannot be inhibited if there is a strong and unusual signal acquired. If someone search for his keys (volitional top-down), he will not take care about a car passing by. But if he hears a strange sound (bottom-up) and then recognizes a lion (memory-related top-down attention), he will stop searching the keys and run away … Volitional top-down attention is able to inhibit the other components of attention only if the other attentions are not very important.

Attention vs. attentions: a summary

The study of attention is an accumulation of disciplines ranging from philosophy to computer science and passing by psychology and neuroscience. Those disciplines study sometimes different aspects or views of attention, which lead to the fact that giving a single and precise definition of attention is simply not feasible.

To sum-up the different approaches attention is about:

  • eye/neck mechanics and outside world information acquisition: the attentional “embodiment” leads to parallel and serial attention (overt versus covert attention)
  • allocation of cognitive resources to important incoming information: the attentional “filtering” is the first step towards data structuring (degree of focus and clinical model of attention)
  • mutual influence on memory and emotions: passing of important information to a conscious state and get feedback from memory and emotions (bottom-up and memory-related top-down attention)
  • behavior update: react to novel situations but also manage the goals and actions (bottom-up and volitional top-down attention)

Attention plays a crucial role from signal acquisition to action planning going through the main cognitive steps… or maybe there are simply several attentions and not only one. At this point this question still has no final answer.


[1] Desimone, Robert, and John Duncan. “Neural mechanisms of selective visual attention.” Annual review of neuroscience 18.1 (1995): 193-222.

[2] Laberge (1999). Networks of Attention. In: Gazzaniga, Michael S., ed. The cognitive neurosciences. MIT press, 2004.

[3] Treisman, Anne M., and Garry Gelade. “A feature-integration theory of attention.” Cognitive psychology 12.1 (1980): 97-136.

[4] Koch, Christof, and Shimon Ullman. “Shifts in selective visual attention: towards the underlying neural circuitry.” Matters of intelligence. Springer Netherlands, 1987. 115-141.

[5] Itti, Laurent, Christof Koch, and Ernst Niebur. “A model of saliency-based visual attention for rapid scene analysis.” IEEE Transactions on pattern analysis and machine intelligence 20.11 (1998): 1254-1259.

[6] Torralba, Antonio, et al. “Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.” Psychological review 113.4 (2006): 766.

Computational Attention Insights

Why computers should be attentive?

Any animal [1] from the tiniest insect [2] to humans is perfectly able to “pay attention”. Attention is the first step of perception: it analyses the outer real world and turns it into an inner conscious representation. Even during some dreaming phases known as REM (Rapid Eye Movements), the eye activity proves that the attentional mechanism is at work. But this time it analyses a virtual world coming from the inner subconscious and turns it into an inner conscious representation. Attention seems to be not only the first step of perception, but also the gate to conscious awareness.

The attentional process probably activates with the first developments of a complex sense (like auditory) which comes with the first REM dreams beginning after the sixth months of foetal development [3]. This mechanism is one of the first cognitive processes to be set up and factors like smoke, drugs, alcohol or even stress during pregnancy lead to later attention disorders and even higher chances to develop psychopathologies [4][5]. It is largely proven that for cognitive psychopathologies, the attentive process is highly affected (like in autism or schizophrenia) mainly by studying eye tracking traces which can be very different between patients and the control groups [6][7]. The attentive process is set up as early as the prenatal period when it already begins to operate during babies dreams. Until death it occurs in every single moment of the day when people are awake, but also during their dreams. This shows the importance of attention: it cannot be dissociated from perception and consciousness. Even when the person is sleeping without dreaming and the eyes are not moving, important stimuli can “wake up” a person. Attention is never turned off, it can be only lowered and in standby (excepting drug-induced states when the consciousness is altered or eliminated as in artificial coma). It is thus safe to say that if there is conscious life in a body capable to act on its environment, there is attention.

As a gate of conscious awareness at the interface between inner and outer, attention can be both conscious (attentive) and unconscious (pre-attentive) and it is the key to survival. Attention is also a sign of limited computation capabilities. Vision, audition, touch, smell or taste, they all provide the brain with a huge amount of information. Gigabits of rough sensorial data flow every second into the brain which cannot physically handle such an information rate. Attention provides the brain with the capacity of selecting the main information and building priority tasks. While there are a lot of definitions and views of attention the one core idea which justifies attention regardless the discipline, methodology or intuition is “information reduction” [8].

Attention only begun to be seriously studied from the 19th century with the arrival of modern psychology. Some thoughts about the attention concepts may be found in Descartes, but no rigorous and intensive scientific study was done until the beginning of psychology. How the philosophers missed such a key concept as attention from the antic times to almost now? Part of the answer is given by William James, the father of psychology, in his famous definition of attention: “Everybody knows what attention is”. Attention is so natural, so linked to life and partly unconscious, so obvious that … nobody really noticed it until recently.

However, little by little, a new transversal research field appeared around the concept of “attention” gathering first psychologists, than neuroscientists and even since the end of the nineties’ engineers and computer scientists. While covering the whole research on attention needs a whole series of books, the topic is here narrowed to focus on attention modelling, a crucial step towards wider artificial intelligence.

Indeed, this key process of attention is currently rarely used within computers. As with the brain, a computer is a processing unit. As with the brain it has limited computation capabilities and memory. As with the brain, computers should analyse more and more data. But unlike the brain they do not pay attention. While a classical computer will be more precise in quantifying the whole input data, an attentive computer will focus on the most “interesting” data which has several advantages:

  • It will be faster and more efficient in terms of memory storage due to its ability to process only part of the input data.
  • It will be able to find regularities and irregularities in the input signal and thus be able to detect and react to unexpected or abnormal events.
  • It will be able to optimize data prediction by describing novel patterns, and depending on the information reduction result (how efficient the information reduction was), it will be capable of being curious, bored or annoyed. This curiosity which constantly pushes to the discovery of more and more complex patterns to better reduce information is a first step towards creativity.

As in humans attention is the gate to awareness and consciousness, in computers attention can lead to novel emergent computational paradigms beyond classical pre-programmed machines. While the way towards self-modifying computers is still very long ahead, computational attention develops in an exponential way letting more and more applications benefit from it.


[1] Zentall, Thomas R. “Selective and divided attention in animals.” Behavioural Processes 69.1 (2005): 1-15.
[2] Hoy, Ronald R. “Startle, categorical response, and attention in acoustic behavior of insects.” Annual review of neuroscience 12.1 (1989): 355-375.
[3] Hopson, Janet L. “Fetal psychology.” Psychology Today 31.5 (1998): 44.
[4] Mick, Eric, et al. “Case-control study of attention-deficit hyperactivity disorder and maternal smoking, alcohol use, and drug use during pregnancy.” Journal of the American Academy of Child & Adolescent Psychiatry 41.4 (2002): 378-385.
[5] Linnet, Karen Markussen, et al. “Maternal lifestyle factors in pregnancy risk of attention deficit hyperactivity disorder and associated behaviors: review of the current evidence.” American Journal of Psychiatry 160.6 (2003): 1028-1040.
[6] Holzman, Philip S., et al. “Eye-tracking dysfunctions in schizophrenic patients and their relatives.” Archives of general psychiatry 31.2 (1974): 143-151.
[7] Klin, Ami, et al. “Visual fixation patterns during viewing of naturalistic social situations as predictors of social competence in individuals with autism.”Archives of general psychiatry 59.9 (2002): 809-816.
[8] Itti, Laurent, Geraint Rees, and John K. Tsotsos, eds. Neurobiology of attention. Academic Press, 2005.