Table of Contents
Fetching ...

What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation

Yihua Cheng, Yaning Zhu, Zongji Wang, Hongquan Hao, Yongwei Liu, Shiqing Cheng, Xi Wang, Hyung Jin Chang

TL;DR

This work introduces IVGaze, a pioneering dataset capturing in-vehicle gaze, and proposes a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges and explores a novel strategy for gaze zone classification by extending the GazeDPTR.

Abstract

Driver's eye gaze holds a wealth of cognitive and intentional cues crucial for intelligent vehicles. Despite its significance, research on in-vehicle gaze estimation remains limited due to the scarcity of comprehensive and well-annotated datasets in real driving scenarios. In this paper, we present three novel elements to advance in-vehicle gaze research. Firstly, we introduce IVGaze, a pioneering dataset capturing in-vehicle gaze, collected from 125 subjects and covering a large range of gaze and head poses within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset, we propose a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges. Second, our research focuses on in-vehicle gaze estimation leveraging the IVGaze. In-vehicle face images often suffer from low resolution, prompting our introduction of a gaze pyramid transformer that leverages transformer-based multilevel features integration. Expanding upon this, we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation, we rotate virtual cameras to normalize images, utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR shows state-of-the-art performance on the IVGaze dataset. Thirdly, we explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from the projection points and visual attributes from images, we achieve superior performance compared to relying solely on visual features, substantiating the advantage of gaze estimation. Our project is available at https://yihua.zone/work/ivgaze.

What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation

TL;DR

This work introduces IVGaze, a pioneering dataset capturing in-vehicle gaze, and proposes a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges and explores a novel strategy for gaze zone classification by extending the GazeDPTR.

Abstract

Driver's eye gaze holds a wealth of cognitive and intentional cues crucial for intelligent vehicles. Despite its significance, research on in-vehicle gaze estimation remains limited due to the scarcity of comprehensive and well-annotated datasets in real driving scenarios. In this paper, we present three novel elements to advance in-vehicle gaze research. Firstly, we introduce IVGaze, a pioneering dataset capturing in-vehicle gaze, collected from 125 subjects and covering a large range of gaze and head poses within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset, we propose a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges. Second, our research focuses on in-vehicle gaze estimation leveraging the IVGaze. In-vehicle face images often suffer from low resolution, prompting our introduction of a gaze pyramid transformer that leverages transformer-based multilevel features integration. Expanding upon this, we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation, we rotate virtual cameras to normalize images, utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR shows state-of-the-art performance on the IVGaze dataset. Thirdly, we explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from the projection points and visual attributes from images, we achieve superior performance compared to relying solely on visual features, substantiating the advantage of gaze estimation. Our project is available at https://yihua.zone/work/ivgaze.
Paper Structure (31 sections, 11 equations, 10 figures, 9 tables)

This paper contains 31 sections, 11 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: In-vehicle gaze estimation illustration. The driver's gaze direction is estimated based on the facial images captured by the camera behind the steering wheels.
  • Figure 2: We construct a vision-based in-vehicle gaze collection system comprising a DMS camera, a depth camera, and strategically placed gaze targets, as depicted in (c). The DMS camera is positioned behind the steering wheel to capture drivers' facial appearances, while the gaze targets, positioned beyond the DMS camera's field-of-view (FoV), such as on the windshield, remain unobserved. The depth camera, utilized for calibration purposes, is temporarily installed for capturing gaze target positions in 3D with respect to its own coordinates, and it is removed during data collection. To facilitate the calibration of the depth camera's pose relative to the DMS camera, we propose employing a transparent chessboard, which is placed between the two cameras.
  • Figure 3: Our dataset is collected using IR cameras in the vehicle environment. (a) We present image samples of IVGaze, highlighting the challenges posed by realistic in-vehicle conditions, including cases with sunglasses and reflections in glasses. (b) We categorize the image count based on their mean pixel value, showing the diversity of illumination conditions. (c) The image count is analyzed based on face accessories including glasses, sunglasses, and masks.
  • Figure 4: We show the distribution of data for gaze (left) and head movements (right). Brighter regions denote higher data density.
  • Figure 5: The GazeDPTR directly crop face for origin images and rotates virtual cameras via perspective transformation for normalized images. It builds a dual-stream network to extract features from the two images based on the GazePTR for feature extraction which integrates multi-level features via transformers. To further merge the features from two streams, we leverage a transformer where camera pose is used as the positional feature in the transformer. We define the original camera pose as $\boldsymbol{\rm{C}}=diag(1,1,1)$ and the camera pose in normalization space is $\boldsymbol{\rm{RC}}$. We also extend the network for gaze zone classification. We define a tri-plane and project gaze into them. We extract positional features from three intersection points via a transformer. We also extract visual features from images and predict the gaze zone based on both visual features and positional features. The whole network is trained in an end-to-end manner.
  • ...and 5 more figures