Table of Contents
Fetching ...

CrossGaze: A Strong Method for 3D Gaze Estimation in the Wild

Andy Cătrună, Adrian Cosma, Emilian Rădoi

TL;DR

CrossGaze introduces a dual-encoder architecture that processes a full-face image for global gaze cues and eye crops for local eye cues, merging them with a cross-attention module to predict the $3D$ gaze vector. Trained with cosine loss and enhanced by RandAugment and cutout, the model demonstrates strong performance on the Gaze360 dataset, achieving a mean angular error of $9.94^\circ$ on Front 180° and $7.17^\circ$ on Front Facing with pretrained face weights. Ablation studies confirm the value of multi-scale face features, large-face pretraining (e.g., VGGFace2), and explicit eye-feature fusion via cross-attention. The work provides a robust, transfer-friendly baseline for gaze estimation in real-world scenarios, with potential applications in HCI, driver assistance, and assistive technologies.

Abstract

Gaze estimation, the task of predicting where an individual is looking, is a critical task with direct applications in areas such as human-computer interaction and virtual reality. Estimating the direction of looking in unconstrained environments is difficult, due to the many factors that can obscure the face and eye regions. In this work we propose CrossGaze, a strong baseline for gaze estimation, that leverages recent developments in computer vision architectures and attention-based modules. Unlike previous approaches, our method does not require a specialised architecture, utilizing already established models that we integrate in our architecture and adapt for the task of 3D gaze estimation. This approach allows for seamless updates to the architecture as any module can be replaced with more powerful feature extractors. On the Gaze360 benchmark, our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees. Our proposed model serves as a strong foundation for future research and development in gaze estimation, paving the way for practical and accurate gaze prediction in real-world scenarios.

CrossGaze: A Strong Method for 3D Gaze Estimation in the Wild

TL;DR

CrossGaze introduces a dual-encoder architecture that processes a full-face image for global gaze cues and eye crops for local eye cues, merging them with a cross-attention module to predict the gaze vector. Trained with cosine loss and enhanced by RandAugment and cutout, the model demonstrates strong performance on the Gaze360 dataset, achieving a mean angular error of on Front 180° and on Front Facing with pretrained face weights. Ablation studies confirm the value of multi-scale face features, large-face pretraining (e.g., VGGFace2), and explicit eye-feature fusion via cross-attention. The work provides a robust, transfer-friendly baseline for gaze estimation in real-world scenarios, with potential applications in HCI, driver assistance, and assistive technologies.

Abstract

Gaze estimation, the task of predicting where an individual is looking, is a critical task with direct applications in areas such as human-computer interaction and virtual reality. Estimating the direction of looking in unconstrained environments is difficult, due to the many factors that can obscure the face and eye regions. In this work we propose CrossGaze, a strong baseline for gaze estimation, that leverages recent developments in computer vision architectures and attention-based modules. Unlike previous approaches, our method does not require a specialised architecture, utilizing already established models that we integrate in our architecture and adapt for the task of 3D gaze estimation. This approach allows for seamless updates to the architecture as any module can be replaced with more powerful feature extractors. On the Gaze360 benchmark, our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees. Our proposed model serves as a strong foundation for future research and development in gaze estimation, paving the way for practical and accurate gaze prediction in real-world scenarios.
Paper Structure (7 sections, 4 equations, 1 figure, 4 tables)

This paper contains 7 sections, 4 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: A high-level overview of the CrossGaze architecture. After the face is detected with a pretrained model, we process the features using a separate encoder for the face and for the eyes. The two resulting feature maps are processed using a cross-attention module to obtain the final gaze prediction.