Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

M. Eren Akbiyik; Nedko Savov; Danda Pani Paudel; Nikola Popovic; Christian Vater; Otmar Hilliges; Luc Van Gool; Xi Wang

Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

M. Eren Akbiyik, Nedko Savov, Danda Pani Paudel, Nikola Popovic, Christian Vater, Otmar Hilliges, Luc Van Gool, Xi Wang

TL;DR

This work tackles ego-trajectory prediction by integrating the driver's field-of-view with the surrounding environment. It introduces RouteFormer, a multimodal network that fuses past motion, scene data, and driver gaze to forecast future ego-motion, aided by a future-discounted loss and auxiliary supervision. A new Path Complexity Index (PCI) quantifies scenario difficulty, and the GEM dataset provides synchronized gaze, FOV, and GPS data in urban settings to evaluate human-centric prediction models. Empirical results on GEM and DR(eye)VE show RouteFormer surpassing state-of-the-art methods, with substantial gains when incorporating driver FOV, especially in complex, high-PCI situations. The work establishes a human-centric benchmark and demonstrates practical potential for safer driver-assistance systems.

Abstract

Understanding drivers' decision-making is crucial for road safety. Although predicting the ego-vehicle's path is valuable for driver-assistance systems, existing methods mainly focus on external factors like other vehicles' motions, often neglecting the driver's attention and intent. To address this gap, we infer the ego-trajectory by integrating the driver's gaze and the surrounding scene. We introduce RouteFormer, a novel multimodal ego-trajectory prediction network combining GPS data, environmental context, and the driver's field-of-view, comprising first-person video and gaze fixations. We also present the Path Complexity Index (PCI), a new metric for trajectory complexity that enables a more nuanced evaluation of challenging scenarios. To tackle data scarcity and enhance diversity, we introduce GEM, a comprehensive dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data. Extensive evaluations on GEM and DR(eye)VE demonstrate that RouteFormer significantly outperforms state-of-the-art methods, achieving notable improvements in prediction accuracy across diverse conditions. Ablation studies reveal that incorporating driver field-of-view data yields significantly better average displacement error, especially in challenging scenarios with high PCI scores, underscoring the importance of modeling driver attention. All data and code are available at https://meakbiyik.github.io/routeformer.

Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

TL;DR

Abstract

Paper Structure (40 sections, 9 equations, 16 figures, 13 tables)

This paper contains 40 sections, 9 equations, 16 figures, 13 tables.

Introduction
Related Work
Ego-Motion Prediction with Driver Field-of-View
Task Definition
RouteFormer Architecture
Training Objectives
Implementation Details
Path Complexity Index
Gaze-assisted Ego Motion (GEM) Dataset
Hardware Setup
Data Collection
Comparison with existing datasets
Experiments
Setup
Evaluation of RouteFormer
...and 25 more sections

Figures (16)

Figure 1: RouteFormer framework. Using the past GPS and the scene together with driver field-of-view, we predict the future ego-trajectory and visual features in driving with a novel loss scheme.
Figure 2: Framework details. The multimodal architecture fuses FOV, scene, and motion data for ego-trajectory forecasting. (a) The videos are encoded with the scene encoder $E_S$ frame-wise using a pre-trained vision backbone. FOV data is then encoded via a cross-modal transformer. The resulting tensors, all in the image feature domain, are stacked across time, self-attended, and concatenated with motion features for forecasting. (b) RouteFormer predicts the trajectory, as well as features from visual modalities concurrently to use them as auxiliary losses for regularization.
Figure 3: Generated trajectories and their values. The black paths to the left are inputs, and the colored paths are targets generated exhaustively by varying the speed, turning angle, and turn curvature.
Figure 4: Example trajectories with varying PCI. White is the input and red is the target trajectory.
Figure 5: Qualitative examples. RouteFormer shows higher confidence in sharp turns than other SOTA models using gaze, which tend to prefer the mean of the past trajectory (left). The turn confidence is lower when no driver FOV information is used (right).
...and 11 more figures

Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

TL;DR

Abstract

Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (16)