Table of Contents
Fetching ...

FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

Kaiser Hamid, Can Cui, Khandakar Ashrafi Akbar, Ziran Wang, Nade Liang

TL;DR

FSDAM tackles the problem of driver attention modeling under data scarcity by proposing a few-shot, dual-pathway framework that jointly predicts a spatial gaze distribution and generates attention-grounded captions. It leverages a frozen vision-language backbone with lightweight adapters and introduces a vision-language alignment objective to tether gaze to semantic content, enabling effective learning from around 90–100 annotated examples. The method achieves competitive gaze metrics and caption quality across multiple driving benchmarks, with strong zero-shot transfer and robust performance in data-constrained settings. This work demonstrates the feasibility of explainable driver attention systems in practical, data-limited deployments and points to future work in temporal modeling and distributed attention handling.

Abstract

Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

TL;DR

FSDAM tackles the problem of driver attention modeling under data scarcity by proposing a few-shot, dual-pathway framework that jointly predicts a spatial gaze distribution and generates attention-grounded captions. It leverages a frozen vision-language backbone with lightweight adapters and introduces a vision-language alignment objective to tether gaze to semantic content, enabling effective learning from around 90–100 annotated examples. The method achieves competitive gaze metrics and caption quality across multiple driving benchmarks, with strong zero-shot transfer and robust performance in data-constrained settings. This work demonstrates the feasibility of explainable driver attention systems in practical, data-limited deployments and points to future work in temporal modeling and distributed attention handling.

Abstract

Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Few-shot learning performance on BDD-A dataset. Gaze prediction metrics: CC, SIM, NSS, AUC-J, AUC-B, and KL (inverted). All metrics are linearly min-max normalized to [0,1] where larger area indicates better performance.
  • Figure 2: Dataset curation pipeline. From BDD-A 10.1007/978-3-030-20873-8_42 videos to paired frames, GPT-4o captioning with fixed template, human verification, and final captions. Full details in Appendix.
  • Figure 3: FSDAM architecture for joint gaze prediction and caption generation. A frozen vision-language backbone extracts spatial features $F_t$ and text embeddings $q_{\text{text}}$. The gaze pathway, left, upsamples $F_t$ to predict attention and gaze maps $\widehat{G}$. The caption pathway, right, uses cross-attention over $F_t$ to generate structured explanations.
  • Figure 4: Qualitative comparison of driver attention prediction on BDD-A test scenes showing the input image and attention heatmaps from ground truth, FSDAM trained on 90 samples, U2-NetQin_2020_PR, and DeepLabV3chen2017rethinkingatrousconvolutionsemantic.
  • Figure 5: Driving scene and model output showing the input frame, its ground truth gaze map, and the corresponding FSDAM prediction. The model produces attention heatmaps over key regions and generates structured attention reasoning.
  • ...and 1 more figures