Trends, Applications, and Challenges in Human Attention Modelling
Giuseppe Cartella, Marcella Cornia, Vittorio Cuculo, Alessandro D'Amelio, Dario Zanca, Giuseppe Boccignone, Rita Cucchiara
TL;DR
The paper surveys how human attention modelling—via saliency maps and scanpaths—can guide deep learning across image/video processing, vision-and-language tasks, and language modelling. It provides a taxonomy of modelling approaches and applications, complemented by benchmarks, datasets, and evaluation metrics (e.g., AUC, NSS, string-edit distance). Key themes include integrating gaze data to improve object recognition, captioning, VQA, and reading comprehension, as well as domain-specific applications in robotics, autonomous driving, and medicine. Open challenges such as data scarcity, privacy concerns, and the need for synthetic gaze data and real-time multimodal gaze modelling are discussed, with recommendations for privacy-aware data collection and scalable gaze generation to advance human-AI interaction.
Abstract
Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges. For a comprehensive overview on the ongoing research refer to our dedicated repository available at https://github.com/aimagelab/awesome-human-visual-attention.
