Table of Contents
Fetching ...

Trends, Applications, and Challenges in Human Attention Modelling

Giuseppe Cartella, Marcella Cornia, Vittorio Cuculo, Alessandro D'Amelio, Dario Zanca, Giuseppe Boccignone, Rita Cucchiara

TL;DR

The paper surveys how human attention modelling—via saliency maps and scanpaths—can guide deep learning across image/video processing, vision-and-language tasks, and language modelling. It provides a taxonomy of modelling approaches and applications, complemented by benchmarks, datasets, and evaluation metrics (e.g., AUC, NSS, string-edit distance). Key themes include integrating gaze data to improve object recognition, captioning, VQA, and reading comprehension, as well as domain-specific applications in robotics, autonomous driving, and medicine. Open challenges such as data scarcity, privacy concerns, and the need for synthetic gaze data and real-time multimodal gaze modelling are discussed, with recommendations for privacy-aware data collection and scalable gaze generation to advance human-AI interaction.

Abstract

Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges. For a comprehensive overview on the ongoing research refer to our dedicated repository available at https://github.com/aimagelab/awesome-human-visual-attention.

Trends, Applications, and Challenges in Human Attention Modelling

TL;DR

The paper surveys how human attention modelling—via saliency maps and scanpaths—can guide deep learning across image/video processing, vision-and-language tasks, and language modelling. It provides a taxonomy of modelling approaches and applications, complemented by benchmarks, datasets, and evaluation metrics (e.g., AUC, NSS, string-edit distance). Key themes include integrating gaze data to improve object recognition, captioning, VQA, and reading comprehension, as well as domain-specific applications in robotics, autonomous driving, and medicine. Open challenges such as data scarcity, privacy concerns, and the need for synthetic gaze data and real-time multimodal gaze modelling are discussed, with recommendations for privacy-aware data collection and scalable gaze generation to advance human-AI interaction.

Abstract

Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges. For a comprehensive overview on the ongoing research refer to our dedicated repository available at https://github.com/aimagelab/awesome-human-visual-attention.
Paper Structure (10 sections, 1 figure, 1 table)

This paper contains 10 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: An overview of sample architectures integrating human visual attention with different input and output modalities. Human visual attention has been employed to solve tasks in diverse domains spanning from image and video processing, automatic captioning, visual question answering, and language understanding, as well as robotics, autonomous driving, and medicine.