Driver Attention Tracking and Analysis
Dat Viet Thanh Nguyen, Anh Tran, Hoai Nam Vu, Cuong Pham, Minh Hoai
TL;DR
The paper tackles driver gaze estimation in 3D, dynamic traffic scenes using a dashboard-mounted setup with two ordinary cameras. It introduces DPEN, a two-branch network with a camera-calibration module that embeds the spatial relationship between the driver and cameras, enabling end-to-end training for accurate gaze localization on the scene. A large in-situ dataset (DPoG) with synchronized face, scene, and gaze data is collected and released, supporting robust evaluation and analysis. DPEN achieves a mean gaze error of 29.69 pixels (eye-angle < 3 degrees) and outperforms baselines, with ablations confirming the importance of the calibration module and scene input. The work advances practical, scalable driver-monitoring capabilities for safety and attention analytics, backed by rich dataset statistics and comprehensive experiments.
Abstract
We propose a novel method to estimate a driver's points-of-gaze using a pair of ordinary cameras mounted on the windshield and dashboard of a car. This is a challenging problem due to the dynamics of traffic environments with 3D scenes of unknown depths. This problem is further complicated by the volatile distance between the driver and the camera system. To tackle these challenges, we develop a novel convolutional network that simultaneously analyzes the image of the scene and the image of the driver's face. This network has a camera calibration module that can compute an embedding vector that represents the spatial configuration between the driver and the camera system. This calibration module improves the overall network's performance, which can be jointly trained end to end. We also address the lack of annotated data for training and evaluation by introducing a large-scale driving dataset with point-of-gaze annotations. This is an in situ dataset of real driving sessions in an urban city, containing synchronized images of the driving scene as well as the face and gaze of the driver. Experiments on this dataset show that the proposed method outperforms various baseline methods, having the mean prediction error of 29.69 pixels, which is relatively small compared to the $1280{\times}720$ resolution of the scene camera.
