Driver Attention Tracking and Analysis

Dat Viet Thanh Nguyen; Anh Tran; Hoai Nam Vu; Cuong Pham; Minh Hoai

Driver Attention Tracking and Analysis

Dat Viet Thanh Nguyen, Anh Tran, Hoai Nam Vu, Cuong Pham, Minh Hoai

TL;DR

The paper tackles driver gaze estimation in 3D, dynamic traffic scenes using a dashboard-mounted setup with two ordinary cameras. It introduces DPEN, a two-branch network with a camera-calibration module that embeds the spatial relationship between the driver and cameras, enabling end-to-end training for accurate gaze localization on the scene. A large in-situ dataset (DPoG) with synchronized face, scene, and gaze data is collected and released, supporting robust evaluation and analysis. DPEN achieves a mean gaze error of 29.69 pixels (eye-angle < 3 degrees) and outperforms baselines, with ablations confirming the importance of the calibration module and scene input. The work advances practical, scalable driver-monitoring capabilities for safety and attention analytics, backed by rich dataset statistics and comprehensive experiments.

Abstract

We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a novel convolutional network that simultaneously analyzes the image of the scene and the image of the driver's face. This network has a camera calibration module that can compute an embedding vector that represents the spatial configuration between the driver and the camera system. This calibration module improves the overall network's performance, which can be jointly trained end to end. We also address the lack of annotated data for training and evaluation by introducing a large-scale driving dataset with point-of-gaze annotations. This is an in situ dataset of real driving sessions in an urban city, containing synchronized images of the driving scene as well as the face and gaze of the driver. Experiments on this dataset show that the proposed method outperforms various baseline methods, having the mean prediction error of 29.69 pixels, which is relatively small compared to the $1280{\times}720$ resolution of the scene camera.

Driver Attention Tracking and Analysis

TL;DR

Abstract

resolution of the scene camera.

Paper Structure (10 sections, 4 equations, 8 figures, 2 tables)

This paper contains 10 sections, 4 equations, 8 figures, 2 tables.

Introduction
Related Work
Drivers' Points-of-Gaze Dataset
Data collection and annotation
Scene and gaze statistics
Drivers' Points-of-Gaze Estimation Network
Network Architecture and Processing Pipeline
Training procedure
Experiments
Conclusions

Figures (8)

Figure 1: Positions of GoPro cameras used for data collection. A camera was attached to the windshield to capture a driver's face and head movements. Another camera was placed on the dashboard, pointing out to the road.
Figure 2: Matching result using RANSAC-Flow shen2020ransac. RANSAC-Flow is used to warp the gaze frame to the scene frame and transfer the gaze point (green dot) from the gaze frame to the scene frame (red dot). On an annotated dataset of 589 instances, the median and mean errors are 9.2 and 25.1 pixels, which are relatively small compared to the $1280{\times}720$ size of the scene frame. The top right corner of the scene frame shows a circle with the radius of 25.1 pixels.
Figure 3: Scene statistics. (a): the percentages of images from the scene camera containing objects in each semantic class. Almost all images contain road, sky, building, vegetation, and car. Bicycles and motorcycles are also seen very often. The least appearing classes are train and traffic_light. (b): the percentages of scene-image pixels belonging to each semantic class. The majority of the pixels belong to road, sky, building, and vegetation.
Figure 4: Semantics of gaze pixels and all pixels. For each semantic class, the red bar shows the percentage of times a fixation point belongs to the class, and the blue bar is the percentage of pixels in the scene camera belonging to this class. The subplot in the middle is the zoom-in window for the classes with smallest percentages of occurrences.
Figure 5: Architecture of the proposed Drivers' Points-of-gaze Estimation Network (DPEN).
...and 3 more figures

Driver Attention Tracking and Analysis

TL;DR

Abstract

Driver Attention Tracking and Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)