Table of Contents
Fetching ...

Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

Jiahang Li, Shibo Xue, Yong Su

TL;DR

This work tackles shortcut bias in visual classification by incorporating human gaze as a supervisory cue. It introduces Gaze-CIFAR-10, a high-resolution dataset with time-series gaze data, and a Dual-Sequence Gaze Encoder that captures the temporal and spatial structure of attention, which is fused with a Vision Transformer to correct mislocalized features. Through extensive experiments and ablations, the approach improves robustness and accuracy across backbones, with notable gains (e.g., up to $85.90\%$) and insights that removing cross-attention can be beneficial in this sparse-gaze setting. The study demonstrates that human gaze priors can guide deep models toward more human-aligned, discriminative localization, with potential impact for data-scarce domains such as medical imaging and few-shot learning.

Abstract

Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.

Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

TL;DR

This work tackles shortcut bias in visual classification by incorporating human gaze as a supervisory cue. It introduces Gaze-CIFAR-10, a high-resolution dataset with time-series gaze data, and a Dual-Sequence Gaze Encoder that captures the temporal and spatial structure of attention, which is fused with a Vision Transformer to correct mislocalized features. Through extensive experiments and ablations, the approach improves robustness and accuracy across backbones, with notable gains (e.g., up to ) and insights that removing cross-attention can be beneficial in this sparse-gaze setting. The study demonstrates that human gaze priors can guide deep models toward more human-aligned, discriminative localization, with potential impact for data-scarce domains such as medical imaging and few-shot learning.

Abstract

Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.

Paper Structure

This paper contains 18 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A toy example illustrating shortcut bias: (a) DNNs attention versus (b) human gaze under limited data scale and diversity.
  • Figure 2: Gaze data collection setup. (a) Overview of our data acquisition system. (b) Step 1: Reconstruct image resolution. Step 2: Participants freely view two randomly selected images from different categories. Step 3: One image is randomly re-sampled from the previously viewed categories and shown again for focused observation. Step 4: Gaze data is transmitted to the PC for processing.
  • Figure 3: Gaze-guided cross-modal fusion network.
  • Figure 4: Training loss and test accuracy comparison between the proposed method and fine-tuned ViT.
  • Figure 5: Comparison between ViT attention maps that lead to misclassification and human gaze points that guide correct classification. The red dot indicates the starting point of the gaze trajectory, while the green dot marks the end point.