Table of Contents
Fetching ...

Data Limitations for Modeling Top-Down Effects on Drivers' Attention

Iuliia Kotseruba, John K. Tsotsos

TL;DR

This paper tackles the data limitations hindering modeling of top-down effects on drivers' attention by reannotating four public gaze datasets (DR(eye)VE, BDD-A, MAAD, LBW) with driving-task and context labels and analyzing data collection/processing pipelines. It demonstrates that bottom-up gaze models struggle to capture action- and context-driven attention due to limited spatial/temporal context and inconsistent gaze-ground-truth, correlating model gaps with data quality rather than mere data quantity. The authors provide concrete recommendations for data collection and annotation to enable explicit top-down modeling, and release annotations and code to support reproducibility. The work highlights the importance of ecologically valid, richly annotated datasets for advancing explainable driver gaze prediction and related perception tasks.

Abstract

Driving is a visuomotor task, i.e., there is a connection between what drivers see and what they do. While some models of drivers' gaze account for top-down effects of drivers' actions, the majority learn only bottom-up correlations between human gaze and driving footage. The crux of the problem is lack of public data with annotations that could be used to train top-down models and evaluate how well models of any kind capture effects of task on attention. As a result, top-down models are trained and evaluated on private data and public benchmarks measure only the overall fit to human data. In this paper, we focus on data limitations by examining four large-scale public datasets, DR(eye)VE, BDD-A, MAAD, and LBW, used to train and evaluate algorithms for drivers' gaze prediction. We define a set of driving tasks (lateral and longitudinal maneuvers) and context elements (intersections and right-of-way) known to affect drivers' attention, augment the datasets with annotations based on the said definitions, and analyze the characteristics of data recording and processing pipelines w.r.t. capturing what the drivers see and do. In sum, the contributions of this work are: 1) quantifying biases of the public datasets, 2) examining performance of the SOTA bottom-up models on subsets of the data involving non-trivial drivers' actions, 3) linking shortcomings of the bottom-up models to data limitations, and 4) recommendations for future data collection and processing. The new annotations and code for reproducing the results is available at https://github.com/ykotseruba/SCOUT.

Data Limitations for Modeling Top-Down Effects on Drivers' Attention

TL;DR

This paper tackles the data limitations hindering modeling of top-down effects on drivers' attention by reannotating four public gaze datasets (DR(eye)VE, BDD-A, MAAD, LBW) with driving-task and context labels and analyzing data collection/processing pipelines. It demonstrates that bottom-up gaze models struggle to capture action- and context-driven attention due to limited spatial/temporal context and inconsistent gaze-ground-truth, correlating model gaps with data quality rather than mere data quantity. The authors provide concrete recommendations for data collection and annotation to enable explicit top-down modeling, and release annotations and code to support reproducibility. The work highlights the importance of ecologically valid, richly annotated datasets for advancing explainable driver gaze prediction and related perception tasks.

Abstract

Driving is a visuomotor task, i.e., there is a connection between what drivers see and what they do. While some models of drivers' gaze account for top-down effects of drivers' actions, the majority learn only bottom-up correlations between human gaze and driving footage. The crux of the problem is lack of public data with annotations that could be used to train top-down models and evaluate how well models of any kind capture effects of task on attention. As a result, top-down models are trained and evaluated on private data and public benchmarks measure only the overall fit to human data. In this paper, we focus on data limitations by examining four large-scale public datasets, DR(eye)VE, BDD-A, MAAD, and LBW, used to train and evaluate algorithms for drivers' gaze prediction. We define a set of driving tasks (lateral and longitudinal maneuvers) and context elements (intersections and right-of-way) known to affect drivers' attention, augment the datasets with annotations based on the said definitions, and analyze the characteristics of data recording and processing pipelines w.r.t. capturing what the drivers see and do. In sum, the contributions of this work are: 1) quantifying biases of the public datasets, 2) examining performance of the SOTA bottom-up models on subsets of the data involving non-trivial drivers' actions, 3) linking shortcomings of the bottom-up models to data limitations, and 4) recommendations for future data collection and processing. The new annotations and code for reproducing the results is available at https://github.com/ykotseruba/SCOUT.
Paper Structure (17 sections, 10 figures, 4 tables)

This paper contains 17 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: An example from DR(eye)VE, showing the view of the roundabout from the camera mounted on the driver's head (left) and from the vehicle forward-facing camera (right). Note that the forward camera does not capture what the driver can see from the side window by turning their head.
  • Figure 2: BDD-A videos with quality issues: reflection on the windshield (left) and tilted and obstructed view (right).
  • Figure 3: Overexposed (left) and underexposed (right) frames in LBW.
  • Figure 4: Gaze recorded on-road (from DR(eye)VE) and in-lab (from MAAD) for right-of-way and yielding episodes.
  • Figure 5: Example of gaze recorded on-road and in-lab aggregated over one yielding scenario.
  • ...and 5 more figures