Attention in computer science – Part 2

In the previous part we mainly dealt with visibility models and static saliency models of attention. But the notion of computational attention could not remain only focused on static images and it developed in other modalities.

Video saliency

Some still image models were simply extended to video. For example, Seo & Milanfar (2009) introduced the time dimension by replacing square spatial patches by 3D spatio-temporal cubic patches where the third dimension is the time. Itti’s model was also generalized with the addition of motion features and ﬂickering to the initial spatial set of features containing luminance, color and orientations. Those models mainly show that important motion is salient. A question might be: what saliency models can bring more than a good motion detector? Models like Mancas et al. (2011) have developed a bottom-up saliency map to detect abnormal motion. The model exhibits promising results from a few moving objects to dense crowds with increasing performance. The idea is to show that motion is most of the time salient but within motion, some moving areas are more interesting than others.

3D saliency

3D saliency modeling is an emerging area of research, which was boosted by two facts: First, the arrival of aﬀordable RGB-D cameras which provide both classical RGB images and a depth map describing pixels distance from the camera. This depth information is very important and it provides new features (curvature, compactness, convexity, …). The second event a higher availability of 3D models (used for example in 3D printing). 3D models are more easily available and libraries like PCL Aldoma et al. (2012) can handle 3D point clouds, convert formats and compute features from those point clouds. As for video, most of the 3D saliency models are extensions of still images models. Some use the 3D meshes based on Itti’s approach, others just add the depth as an additional feature while recent models are based on the use of point clouds. As 3D saliency models are mainly extensions of 2D models, depending on the extended model, the diﬀerent features can be taken into account locally and/or globally on the 3D objects.

Audio saliency

There are very few auditory attention models compared to visual attention models. One approach deals with the local context for audio signals. Kayser et al. (2005) computes auditorysaliencymapsbasedonItti’svisualmodel(1998). First, thesoundwaveisconverted to a time-frequency representation. Then three auditory features are extracted on diﬀerent scalesandinparallel(intensity, frequencycontrast, andtemporalcontrast). Foreachfeature, the maps obtained at diﬀerent scales are compared using a center-surround mechanism and normalized. Finally, a linear combination builds the saliency map which is then reduced to one dimension to be able to ﬁt on the one-dimensional audio signal. Anotherapproachtocomputeauditorysaliencymapisbasedonfollowingthewell-established approach of Bayesian Surprise in computer vision (Itti & Baldi (2006)). An auditory surprise is introduced to detect acoustically salient events. First, a Short-Time Fourier transform (STFT) is used to calculate the spectrogram. The surprise is computed in the Bayesian framework.

Top-down saliency

Top-down is endogenous information and comes from the inner world (information from memory, their related emotional level and also the task-related information). In practice, two main families of top-down information can be added to bottom-up attention models.

4.1 What is normal?

The ﬁrst one mainly deals with learned normality which can come from the experience from the current signal if it is time varying, or from previous experience (tests, databases). Concerning still images, the “normal” gaze behavior can be learned from the “mean observer”. Eye-tracking techniques Judd et al. (2009) or mouse-tracking Mancas (2007) can be used on several users, and the average of their gaze on a set of natural images can be computed. The results show that, for natural images, the eye gaze is attracted by the center of the images. This observation for natural images is very diﬀerent from more speciﬁc images which use a priori knowledge. Mancas (2009) showed using mouse tracking that gaze density is very diﬀerent on a set of advertisements and on a set of websites. This is partly due to a priori knowledge that people have about those images (areas containing title, logo, menu). For video signals, it is also possible to accumulate in time motion patterns for each extracted feature to get a model of normality. After a given period of observation, the model can detect that in a given location moving objects are generally fast and going from left to right. If an object, at the same location, is slow and/or going from right to left, this is surprising given what was previously learned from the scene, thus attention will be directed to this object. For 3D signals, another information is the proximity of objects. A close object is more likely to attract attention as it is more likely to be the ﬁrst that we will have to interact with.

4.2 Where are my keys?

While the previous section dealt with attention attracted by events which lead to situations which are not consistent with the knowledge acquired about the scene, here we focus on a second main top-down cue which is a visual task (“Find the keys!”). This task will also have a huge inﬂuence on the way the image is attended and it will imply object recognition (“Recognize the keys”) and object usual location (“they could be on the ﬂoor, but never on the ceiling”).

Object recognition can be achieved through classical methods or using points of interest (like SIFT, SURF …Bay et al. (2008)). Some authors integrated the notion of object recognition into the architecture of their model like Navalpakkam & Itti (2005). They extract the same features as for the bottom-up model, from the object and learn them. This learning step will provide weight modiﬁcation for the fusion of the conspicuity maps which will lead to the detection of the areas which contain the same feature combination as the learned object. Another approach in adding top-down information is in providing with a higher weight the areas from the image which have a higher probability to contain the searched object. Several authors as Oliva et al. (2003) developed methods to learn objects’ location.

Learning bottom-up and top-down together

Recently, learning the salient features becomes more and more popular: the idea is not to ﬁnd the rare regions, but to ﬁnd an optimal description of those rare regions which are already known from eye-tracking or mouse-tracking ground truth. The learning is based on deep neural networks, sparse coding and pooling based on large images datasets where the regions of interest are known. The most attended regions based on eye-tracking results are used to train classiﬁers which will extract the main features of these areas. The use of deep neural networks greatly improved those techniques which are now able to extract meaningful middle and high level features which can describe the best the salient regions Shen & Zhao (2014). Indeed, this learning step will ﬁnd the classical bottom-up features in the ﬁrst layers, but it will also add context, centred gaussian, object detection (faces, text) and recognition together. An issue with those methods is a loss of generality of the models which will work for given datasets, even if, the deep learning is able to cope with high variability in the case of general images for example.

Attention in computer science

In computer science there are two families of models: some are based on feature visibility and others on the concept of saliency maps, the latter approach being the most proliﬁc. For saliency-based bottom-up attention the idea is the same for all the models: ﬁnd areas in the image which are the most surprising in a given context (local, global or normality-based). Finally a set of top-down features which can inﬂuence the saliency-based models are reviewed. Recently deep neural networks are used to integrate both bottom-up and top-down information in the same time.

References

Aldoma, A., Marton, Z.-C., Tombari, F., Wohlkinger, W., Potthast, C., Zeisl, B., Rusu, R. B., Gedikli, S. & Vincze, M. (2012). Point cloud library, IEEE Robotics & Automation Magazine 1070(9932/12).

Bay, H., Ess, A., Tuytelaars, T. & Gool, L. V. (2008). Surf: Speeded up robust features, Computer Vision and Image Understanding (CVIU) 110(3): 346–359.

Itti, L. & Baldi, P. F. (2006). Modeling what attracts human gaze over dynamic natural scenes, inL.Harris&M.Jenkin(eds), Computational Vision in Neural and Machine Systems, Cambridge University Press, Cambridge, MA.

Judd, T., Ehinger, K., Durand & Torralba, A. (2009). Learning to predict where humans look, IEEE Inter. Conf. on Computer Vision (ICCV), pp. 2376–2383.

Kayser, C., Petkov, C., Lippert, M. & Logothetis, N. K. (2005). Mechanisms for allocating auditory attention: An auditory saliency map, Curr. Biol. 15: 1943–1947.

Mancas, M. (2007). Computational Attention Towards Attentive Computers, Presses universitaires de Louvain.

Mancas, M. (2009). “relative inﬂuence of bottom-up and top-down attention, Attention in Cognitive Systems, Vol. 5395 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg.

Mancas, M., Riche, N. & J. Leroy, B. G. (2011). Abnormal motion selection in crowds using bottom-up saliency, IEEE ICIP.

Navalpakkam, V. & Itti, L. (2005). Modeling the inﬂuence of task on attention, Vision Research 45(2): 205–231.

Oliva, A., Torralba, A., Castelhano, M. & Henderson, J. (2003). Top-down control of visual attention in object detection, Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, Vol. 1, pp. I – 253–6 vol.1.

Seo, H. J. & Milanfar, P. (2009). Static and space-time visual saliency detection by selfresemblance, Journal of Vision 9(12). URL: http://www.journalofvision.org/content/9/12/15.abstract

Shen, C. & Zhao, Q. (2014). Learning to predict eye ﬁxations for semantic contents using multi-layer sparse network, Neurocomputing 138: 61–68.