What is attention? – Part 2: From neuroscience to computer science

Attention: the technology comes in

After the 1980^th “crisis” in attention research, two different communities appeared in the study of attention with the arrival of tools providing new insights on brain behavior and with the increasing power of computers. One community deals with cognitive neuroscience and it intends, along with the cognitive psychology, to understand the deep mechanisms of attention, while the other community focuses on engineering and computer science and its goal is to develop attention models to be applied in signal processing and especially in image processing (Figure 1).

Fig. 1 Attention history: an accumulation of domains in onion layers

The arrival of new techniques and computational capacities brought fresh air (and results) in the study of attention.

Attention in cognitive neuroscience

Cognitive neuroscience arrived with a whole set of new tools and methods. If some of them were already used in cognitive psychology (EEG, eye-tracking devices …) others are new tools providing new insights on brain behavior:

Psychophysical methods: scalp recording of EEG (electroencephalography: measures the electric activity of the neurons) and MEG (Magnetoencephalography: measures avec the magnetic activity of the neurons) which are complementary in terms of sensitivity on different brain areas of interest.
Neuroimaging methods: functional MRI and PET scan images, which both measure the areas in the brain which have intense activity given a task that the subject executes (visual, audio …).
Electrophysiological methods: single-cell recordings which measure the electro-physiological responses of a single neuron using a microelectrode system. While this system is much more precise, it is also more invasive.
Other methods: TMS (transcranial magnetic stimulation which can be used to stimulate a region of the brain and to measure the activity of specific brain circuits in humans) and multi-electrodes technology which allows the study of the activity of many neurons simultaneously showing how different neuron populations interact and collaborate.The first and most well-known model is the one by Desimone and Duncan on biased competition [1]. The central idea is that at any given moment, there is more information in the environment than can be processed. Relevant information always competes with irrelevant information to influence behavior. Attention biases this competition, increasing the influence of behavior-relevant information and decreasing the influence of irrelevant information. Desimone explicitly suggest a physiologically plausible neural basis that mediates this competition for the visual system. A receptive field of the neuron is a window to the outside world. It reacts only to stimuli in this window and is insensitive to stimulation in other areas. The authors assume, that the competition between stimuli takes place if more than one stimulus share the same receptive field. This approach is very interesting as each neuron can be seen as a filter by itself and the neurons receptive field can be from very small and precise (like in the visual cortex V1) to very large which focus on entire objects (like IT brain area). This basic idea confirms different approaches of attention (location-based, feature-based, object-based, attentional bottleneck) in a very natural and elegant way. Moreover, a link is achieved with memory based on the notion of attentional templates in working memory which enhances neurons response depending on previous acquired data.While cognitive neuroscience brought a lot of new information to cognitive psychology, still the attention process is far from being fully understood and a lot of work is undergoing in the field. A second family of models was setup by Laberge in late 1990s [2]. It is a structural model based on neuropsychological findings and data from neuroimaging studies. Laberge conjectures that at least three brain regions are concurrently involved in the control of attention: frontal areas, especially the prefrontal cortex; thalamic nuclei, especially the pulvinar and posterior sites, the posterior parietal cortex and the interparietal sulcus. Laberge proposes that these regions are necessary for attention and all these regions presumably give rise to attentional control together. Using those techniques, two main families of theories raised.

Attention in computer science

While the cognitive neuroscience focuses on researching the very nature of attention, a different angle is approached in the 1980s with the developments of computational power. Building on Treisman and Gelade feature integration theory [3] C. Koch and S. Ullman [4] proposed that the different visual features that contribute to attentive selection of a stimulus (color, orientation, movement, etc.) are combined into one single topographic map, called the ”saliency map”. This one integrates the normalized information from the individual feature maps into one global measure. Bottom-up saliency is determined by how different a stimulus is from its surround at several scales. The saliency map provides the probability, for each region in the visual field, to be attended. This saliency map concept is close to that of the “master map” postulated in the feature integration theory by Treisman and Gelade.

The first computational implementation of Koch and Ullman architecture was achieved by Laurent Itti in his seminal work [5]. This very first computational implementation of an attention system takes as an input any image and outputs a saliency map of this image and also the winner-take-all-based mechanism simulating the eye fixations during selective attention. From that point, hundreds of models developed first for images, than for videos and some of them for audio or even 3D data very recently.

From the initial biologically-inspired models a bunch of models based on mathematics, statistics or information theory arrived on the “saliency market” predicting better and better human attention. They are all based on features extracted from the signal (most of the time low-level features but not always), such as luminance, color, orientation, texture, motion, objects relative position or even simply neighborhoods or patches from the signal. Once those features are extracted, all the existing methods are essentially based on the same principle: looking for “contrasted, rare, surprising, novel, worthy-to-learn, less compressible, maximizing the information” areas. All those words are actually synonyms and they all amount to searching for some unusual features in a given context. This context can be local (typically center-surround spatial or temporal contrasts), global (whole image or very long temporal history), or it can be a model of normality (the image average, the image frequency content). Very recently learning is more and more involved into saliency computation: first it was mainly about adjusting model coefficients given a precise task, now complex classifiers like deep neural networks begin to be used to both extract the features from the signal and train the most salient features based on ground truth obtained with eye-tracking or mouse-tracking data.

So … what is attention?

The trans-domain nature of attention naturally lead to a lot of different definitions. Attention deals with the allocation of cognitive resources to important incoming information in order to bring them to a conscious state, update a scene model, update the memory and influence the behavior. But several attention mechanism were highlighted especially from Cherry’s cocktail party issue. A dichotomy appeared between divided attention and selective attention. From there, a clinical model of attention divided into five different “kinds” appeared. One can also talk about different kinds of attention when it needs the eye focus or not, or when it uses only the image features or also the memory and emotions… While its purpose seems to be the relation between the outer world and inner consciousness, memory and emotions, the clinical manifestation of attention tends to show that there might be several attentions.

Overt vs. covert: the eye

Overt versus covert attention is an attention property which was found at the very beginning of the psychological studies of attention. Overt attention is the one which can be exhibited by eyes activity, or more generally by focus of attention. Covert attention does not induce eye movements or a specific focus: it is the ability to catch (and thus be able to bring to consciousness) regions of an image which are not fixated by the eyes. The eye achieves mainly 3 types of movements which are dues to the non-linear repartition of receptive cells (cones and rods) on the retina. The cones which provide a high resolution and color are mainly concentrated in the middle of the retina in a region called “fovea”. This means that in order to acquire a good spatial resolution of an image the eye must gaze towards this precise area to align it on the fovea. This constraint led to mainly three types of the eye movements are the followings:

Fixations: the gaze stays a minimal time period on approximately the same spatial area. The eye gaze is never still. Even when gazing a specific location, micro-saccades can be detected. The micro-saccades are very small movements of the eye during area fixations.
Saccades: the eyes have a ballistic movement between two fixations. They disengage from one fixation and they are very rapidly shifted to the second fixation. Between the two fixations, no visual data is acquired.
Smooth pursuit: a smooth pursuit is a fixation … on a moving object. The eye will follow a moving object to maintain it in the fovea (central part of the retina). During smooth pursuits, more brutal small correction can be done in case of eye retina movement. This smooth pursuit with small corrections is called a nystagmus.

Modelling covert attention will predict human fixations and the prediction of the dynamical path of the eye (called the eye “scanpath”).

Serial vs. parallel: the cognitive load

While focused, sustained and selective attention deal with a serial processing of information, alternating and divided attention deal with parallel processing of several tasks. These facts show that attention can deal with information both serially and in parallel. While there is a limit of the number of tasks which are processed in parallel during divided attention (around 5 tasks), in the case of pre-attentive processing, massively parallel computation can be done. Some notions as the gist [6] seem to be very fast and able to process (very roughly) the entire visual field to get a first idea about the context of the environment. The five kinds of attentions follow a hierarchy based on the degree of focus, thus the cognitive load which is needed to achieve the attentive task. This approach is sometimes called the clinical model of attention.

Focused attention: respond to specific stimuli (focus on a precise task).
Sustained attention: keep a consistent response during longer continuous activity (stay attentive a long period of time and follow the same topic).
Selective attention: selectively maintain the cognitive resource on specific stimuli (focus only on a given object while ignoring distractors).
Alternating attention: switch between multiple tasks (stop reading to watch something).
Divided attention: deal simultaneously with multiple tasks (talking while driving).

Bottom-up vs. top-down: the memory and actions

Another fundamental property of attention needs to be taken into account: attention is a mix of two components called bottom-up (or exogenous) and top-down (or endogenous) components. The bottom-up component is reflex-based and uses the acquired signal. Attention is attracted by the novelty of some features in a given context (spatial local: there is a contrasted region, spatial global: there is a red dot while all the other are blue, temporal: there is a slow motion while before motion was fast…). Its main purpose is to alert in case of unexpected or rare situations and it is tightly related to survival. This first component of attention is the one which is the best modeled in computer science as the signal features are objective cues which can be easily extracted in a computational way.

The second component of attention (top-down) deals with individual subjective feelings. It is related to memory, emotions and individual goal. This component of attention is less easy to model in a computational way as it is more subjective and it needs to have cues about individual goals, a priori knowledge or emotions. Top-down attention can be itself divided into two sub-components:

Goal/Action-related: depending on an individual current goal, certain features or locations are inhibited and other receive more weight. The same individual with the same prior knowledge responds differently to the same stimuli when the task in hand is different. This component is sometimes called “volitional”.
Memory/Emotion-related: this process is related to experience and prior knowledge (and the emotions related to them). In this category one can find the scene context (experience from previously viewed scenes with similar spatial layouts or similar motion behavior) or object recognition (you see your grandmother first in the middle of other people). This component of attention is more “automatic”, it does not need an important cognitive load and it can come in addition to volitional attention. In the other direction the volitional top-down attention cannot inhibit the memory-related attention which will still work even if a goal is present or not. More generally, bottom-up attention cannot be inhibited if there is a strong and unusual signal acquired. If someone search for his keys (volitional top-down), he will not take care about a car passing by. But if he hears a strange sound (bottom-up) and then recognizes a lion (memory-related top-down attention), he will stop searching the keys and run away … Volitional top-down attention is able to inhibit the other components of attention only if the other attentions are not very important.

Attention vs. attentions: a summary

The study of attention is an accumulation of disciplines ranging from philosophy to computer science and passing by psychology and neuroscience. Those disciplines study sometimes different aspects or views of attention, which lead to the fact that giving a single and precise definition of attention is simply not feasible.

To sum-up the different approaches attention is about:

eye/neck mechanics and outside world information acquisition: the attentional “embodiment” leads to parallel and serial attention (overt versus covert attention)
allocation of cognitive resources to important incoming information: the attentional “filtering” is the first step towards data structuring (degree of focus and clinical model of attention)
mutual influence on memory and emotions: passing of important information to a conscious state and get feedback from memory and emotions (bottom-up and memory-related top-down attention)
behavior update: react to novel situations but also manage the goals and actions (bottom-up and volitional top-down attention)

Attention plays a crucial role from signal acquisition to action planning going through the main cognitive steps… or maybe there are simply several attentions and not only one. At this point this question still has no final answer.

References:

[1] Desimone, Robert, and John Duncan. “Neural mechanisms of selective visual attention.” Annual review of neuroscience 18.1 (1995): 193-222.

[2] Laberge (1999). Networks of Attention. In: Gazzaniga, Michael S., ed. The cognitive neurosciences. MIT press, 2004.

[3] Treisman, Anne M., and Garry Gelade. “A feature-integration theory of attention.” Cognitive psychology 12.1 (1980): 97-136.

[4] Koch, Christof, and Shimon Ullman. “Shifts in selective visual attention: towards the underlying neural circuitry.” Matters of intelligence. Springer Netherlands, 1987. 115-141.

[5] Itti, Laurent, Christof Koch, and Ernst Niebur. “A model of saliency-based visual attention for rapid scene analysis.” IEEE Transactions on pattern analysis and machine intelligence 20.11 (1998): 1254-1259.

[6] Torralba, Antonio, et al. “Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.” Psychological review 113.4 (2006): 766.