Table of Contents
Fetching ...

SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction

Iuliia Kotseruba, John K. Tsotsos

TL;DR

This work tackles the problem of predicting drivers' gaze during safety-critical maneuvers without relying on privileged top-down annotations. It introduces SCOUT+, a map- and route-aware model that fuses visual scene features with GPS-derived map information via cross-attention, using a Video Swin Transformer for visuals and a lightweight CNN for maps. Evaluations on DR(eye)VE and BDD-A show that map-informed SCOUT+ delivers competitive performance relative to top-down SCOUT and outperforms bottom-up baselines, with notable gains in lateral actions and intersections. The approach demonstrates practical effectiveness by deriving context from publicly available GPS data and maps, though it is limited by map fidelity and could benefit from richer road-structure and traffic-relationship data in future work.

Abstract

Accurate prediction of drivers' gaze is an important component of vision-based driver monitoring and assistive systems. Of particular interest are safety-critical episodes, such as performing maneuvers or crossing intersections. In such scenarios, drivers' gaze distribution changes significantly and becomes difficult to predict, especially if the task and context information is represented implicitly, as is common in many state-of-the-art models. However, explicit modeling of top-down factors affecting drivers' attention often requires additional information and annotations that may not be readily available. In this paper, we address the challenge of effective modeling of task and context with common sources of data for use in practical systems. To this end, we introduce SCOUT+, a task- and context-aware model for drivers' gaze prediction, which leverages route and map information inferred from commonly available GPS data. We evaluate our model on two datasets, DR(eye)VE and BDD-A, and demonstrate that using maps improves results compared to bottom-up models and reaches performance comparable to the top-down model SCOUT which relies on privileged ground truth information. Code is available at https://github.com/ykotseruba/SCOUT.

SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction

TL;DR

This work tackles the problem of predicting drivers' gaze during safety-critical maneuvers without relying on privileged top-down annotations. It introduces SCOUT+, a map- and route-aware model that fuses visual scene features with GPS-derived map information via cross-attention, using a Video Swin Transformer for visuals and a lightweight CNN for maps. Evaluations on DR(eye)VE and BDD-A show that map-informed SCOUT+ delivers competitive performance relative to top-down SCOUT and outperforms bottom-up baselines, with notable gains in lateral actions and intersections. The approach demonstrates practical effectiveness by deriving context from publicly available GPS data and maps, though it is limited by map fidelity and could benefit from richer road-structure and traffic-relationship data in future work.

Abstract

Accurate prediction of drivers' gaze is an important component of vision-based driver monitoring and assistive systems. Of particular interest are safety-critical episodes, such as performing maneuvers or crossing intersections. In such scenarios, drivers' gaze distribution changes significantly and becomes difficult to predict, especially if the task and context information is represented implicitly, as is common in many state-of-the-art models. However, explicit modeling of top-down factors affecting drivers' attention often requires additional information and annotations that may not be readily available. In this paper, we address the challenge of effective modeling of task and context with common sources of data for use in practical systems. To this end, we introduce SCOUT+, a task- and context-aware model for drivers' gaze prediction, which leverages route and map information inferred from commonly available GPS data. We evaluate our model on two datasets, DR(eye)VE and BDD-A, and demonstrate that using maps improves results compared to bottom-up models and reaches performance comparable to the top-down model SCOUT which relies on privileged ground truth information. Code is available at https://github.com/ykotseruba/SCOUT.
Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of the map and route inferred from DR(eye)VE GPS data. Left: street network with overlaid route traveled by the vehicle. Right: enlarged portion of the map shows map matching the original noisy GPS coordinates (shown in red) to the street network (green). Map colors are inverted here for readability. For use in training and inference, street network and route are rendered as white lines on a black background.
  • Figure 2: Diagram of the SCOUT+ architecture. Scene encoder takes in a set of images and outputs 3D spatio-temporal features. To obtain map and route features, the map generation module extracts the map and route information from the GPS data using OpenStreetMap (OSM) API. Then, a patch of the map corresponding to the location in the visual input is cropped and fed into the map encoder, which produces 2D map and route features via a shallow CNN network. Scene-map transformer applies cross-attention (CA) to fuse map and visual features. The decoder receives input either directly from the encoder or from the CA blocks (shown as dashed arrows) and gradually mixes and upscales features to obtain the final saliency map.
  • Figure 3: Qualitative samples showing performance of SCOUT+ on challenging scenarios near intersections in DR(eye)VE (top 3 rows) and BDD-A (bottom 3 rows) against the bottom-up model ViNet and SCOUT which uses privileged task information.