Table of Contents
Fetching ...

Egocentric Gaze Estimation via Neck-Mounted Camera

Haoyu Huang, Yoichi Sato

TL;DR

This work introduces neck-mounted view gaze estimation as a new egocentric gaze task, addressing the lack of neck-mounted data by collecting the first ~4 hours of synchronized head-mounted gaze and neck-mounted video from 8 participants. It benchmarked a transformer-based gaze model (GLC) on this domain and proposed two domain-specific augmentations: an auxiliary in-view classification and a multi-view co-learning scheme with latent feature alignment conditioned on the relative camera rotation. The auxiliary in-view classifier yielded modest gains over direct fine-tuning, while multi-view co-learning did not improve performance. The study provides a dataset, analysis of neck-mounted gaze characteristics (notably higher out-of-view rates and different center bias), and directions for future work, including scaling data and developing neck-tailored architectures with broader device support.

Abstract

This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.

Egocentric Gaze Estimation via Neck-Mounted Camera

TL;DR

This work introduces neck-mounted view gaze estimation as a new egocentric gaze task, addressing the lack of neck-mounted data by collecting the first ~4 hours of synchronized head-mounted gaze and neck-mounted video from 8 participants. It benchmarked a transformer-based gaze model (GLC) on this domain and proposed two domain-specific augmentations: an auxiliary in-view classification and a multi-view co-learning scheme with latent feature alignment conditioned on the relative camera rotation. The auxiliary in-view classifier yielded modest gains over direct fine-tuning, while multi-view co-learning did not improve performance. The study provides a dataset, analysis of neck-mounted gaze characteristics (notably higher out-of-view rates and different center bias), and directions for future work, including scaling data and developing neck-tailored architectures with broader device support.

Abstract

This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.
Paper Structure (17 sections, 8 figures, 2 tables)

This paper contains 17 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Architecture for the Auxiliary In-view Classification method. The Gaze-in-bound classifier branches out from the bottleneck feature of the original model and performs a binary classification task which classifies whether the gaze is inside the camera field-of-view. The in-bound classification loss is then used in backpropagation together with the heatmap loss.
  • Figure 2: Architecture for the Multi-view Co-learning method. Two GLC lai2024glc models are given with footage and gaze ground truth from head-view camera and neck-view camera, correspondingly. The 3D projection module projects the bottleneck feature into a 3D latent vector field. The vector field is then rotated by the relative extrinsic rotation matrix and an alignment loss is calculated between the rotated features. The alignment loss is then used together with the heatmap loss in the backpropagation.
  • Figure 3: Collection methodology for the neck-mounted view gaze estimation dataset. The user wears an eye-tracker-equipped head-mounted camera and a neck-mounted camera, which records video simultaneously during the collection. The gaze point in the head-view camera coordinate is then mapped into the neck-mounted view camera coordinate frame.
  • Figure 4: The gaze annotation pipeline for our dataset. For each timestep, a paired frame from synchronized head-view and neck-view videos are fed into the VGGT wang2025vggt encoder. The CoTracker karaev2024cotracker tracking head then takes the feature and the gaze point in the head-mounted view coordinate frame, and outputs the gaze point in the neck-mounted view coordinate frame.
  • Figure 5: Distribution of the gaze point in head-view camera and neck-view camera.
  • ...and 3 more figures