Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty

Sharath Koorathota; Nikolas Papadopoulos; Jia Li Ma; Shruti Kumar; Xiaoxiao Sun; Arunesh Mittal; Patrick Adelman; Paul Sajda

Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty

Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda

TL;DR

The paper addresses driving decision prediction under visual uncertainty and proposes gaze-informed Vision Transformer training using a Fixation-Attention Intersection (FAX) loss. By aligning ViT attention with human gaze during training, the approach encourages the model to focus on gaze-relevant regions without sacrificing the model's broad perceptual field, using $\mathcal{I}$ and $\mathcal{L}_{INT}$ terms within $\mathcal{L}_{FAX} = (1-\lambda)\mathcal{L}_{BCE} + \lambda\mathcal{L}_{INT}$. Empirical results on VR and real-world DR(eye)VE datasets show that FAX-trained ViTs achieve higher accuracy under high uncertainty and better align their attention with human gaze, with notable improvements in DR(eye)VE (e.g., $\sim$7.5% gain for 12-FAX over 12-ViT) and dataset-dependent optimal gaze-weighting. The work also demonstrates that layer pruning based on gaze similarity can retain performance with fewer layers, suggesting efficient gaze-guided Transformers for human-centered driving analysis and potentially broader AI systems. The findings have practical significance for driver behavior analysis and the development of gaze-informed AI in complex visual environments, and point to future work in temporal modeling and cross-domain gaze-guided learning.

Abstract

Vision Transformers (ViT) have advanced computer vision, yet their efficacy in complex tasks like driving remains less explored. This study enhances ViT by integrating human eye gaze, captured via eye-tracking, to increase prediction accuracy in driving scenarios under uncertainty in both real-world and virtual reality scenarios. First, we establish the significance of human eye gaze in left-right driving decisions, as observed in both human subjects and a ViT model. By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap demonstrates that fixation data can guide the model in distributing its attention weights more effectively. We introduce the fixation-attention intersection (FAX) loss, a novel loss function that significantly improves ViT performance under high uncertainty conditions. Our results show that ViT, when trained with FAX loss, aligns its attention with human gaze patterns. This gaze-informed approach has significant potential for driver behavior analysis, as well as broader applications in human-centered AI systems, extending ViT's use to complex visual environments.

Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty

TL;DR

and

terms within

. Empirical results on VR and real-world DR(eye)VE datasets show that FAX-trained ViTs achieve higher accuracy under high uncertainty and better align their attention with human gaze, with notable improvements in DR(eye)VE (e.g.,

7.5% gain for 12-FAX over 12-ViT) and dataset-dependent optimal gaze-weighting. The work also demonstrates that layer pruning based on gaze similarity can retain performance with fewer layers, suggesting efficient gaze-guided Transformers for human-centered driving analysis and potentially broader AI systems. The findings have practical significance for driver behavior analysis and the development of gaze-informed AI in complex visual environments, and point to future work in temporal modeling and cross-domain gaze-guided learning.

Abstract

Paper Structure (26 sections, 5 equations, 6 figures, 5 tables)

This paper contains 26 sections, 5 equations, 6 figures, 5 tables.

Introduction
Related work
Proposed Methods
Baseline Vision Transformer
Fixation Maps
Fixation-Attention Intersection (FAX) Loss
Peripheral Masking of the Input
Datasets
VR Dataset
DR(eye)VE Dataset
Uncertainty in Visual Scene
Results
Comparing Human and Model Attention Under Uncertainty
Layer Pruning in Vision Transformers Based on Similarity to Human Attention
Assessing the Impact of Human Eye Gaze on Task Performance
...and 11 more sections

Figures (6)

Figure 1: Peripheral masking of the input.
Figure 2: (A) KDE plot illustrating the distribution of fixations across pixel coordinates (x and y) across all test sample frames in the VR and DR(eye)VE datasets. Fixations are extracted from and aggregated over the premotor period prior to motor decisions. Higher density distribution indicates higher fixation duration. Class-specific (left or right) distributions are denoted in red; the overall distribution is gray. (B and C) Qualitative ViT results from two test samples corresponding to low (B) and high (C) uncertainty conditions in the VR dataset. X = dot product similarity between fixation and respective activation map. Only weights from 3 heads across 3 layers, corresponding to the first, middle, and last layers, respectively, are shown.
Figure 3: (A, B) Total, standardized sum of activations, by uncertainty split, for both datasets. We define total activation in the baseline ViT as the sum of attention weights across layers and heads. Total fixation refers to the pixel-wise sum of fixation maps, a measure of the overall fixation area. Total edge activation refers to the pixel-wise sum of edge maps. (C, D) The similarity between attention weights across layers and fixation maps, using Eq. \ref{['dot_product_intersection']}. Results are aggregated from all test samples on the best-performing, 12-layer baseline ViT. Line color shows the uncertainty split of the test samples, while line style shows whether ViT classified the motor action correctly. Error band shows the 95% CI.
Figure 4: Boxplots displaying the test accuracy in high uncertainty of the top performing models on the DR(eye)VE and VR datasets. 12-ViT and 5-ViT denote Vision Transformer models with 12 and 5 layers, respectively; 12-FAX and 5-FAX represent equivalent ViT models trained with the FAX loss, with the optimal $\lambda$ value in each case. The Mann-Whitney U test assesses statistical significance (* p < 0.05, ** p < 0.01, *** p < 0.001).
Figure 5: Training with FAX loss aligns model attention with human gaze. This figure shows the impact of FAX loss on the ViT model's attention maps for test set samples. (A) Input frame with overlaid human fixation map. (B) Human fixation data aligned to ViT attention map dimensions. (C) Average attention maps across all heads for each ViT layer, for distinct $\lambda$ values in FAX loss (Eq. \ref{['custom_loss']}), showing increasing resemblance to human fixation patterns with higher $\lambda$ values. (D) Intersection over Union (IoU) metric between attention and fixation maps for the test set samples for all $\lambda$ values, quantifying alignment. These results demonstrate that optimal $\lambda$ values in FAX loss (e.g., $\lambda = 0.2, 0.8$ for DR(eye)VE and $\lambda = 0.1, 0.2$ for VR) lead to attention maps better resembling human fixation area, indicating the model's ability to predict human gaze in driving scenarios.
...and 1 more figures

Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty

TL;DR

Abstract

Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty

Authors

TL;DR

Abstract

Table of Contents

Figures (6)