Table of Contents
Fetching ...

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

Gil Shapira, Ishay Goldin, Evgeny Artyomov, Donghoon Kim, Yosi Keller, Niv Zehngut

TL;DR

This work introduces VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants and proposes GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data.

Abstract

Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift.

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

TL;DR

This work introduces VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants and proposes GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data.

Abstract

Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity - particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze - the first large-scale off-axis gaze estimation dataset for VR - comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84-degree mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15-degree person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift.
Paper Structure (28 sections, 5 equations, 13 figures, 8 tables)

This paper contains 28 sections, 5 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: GazeShift architecture. The model is trained to reconstruct the appearance source image to match the gaze target, conditioned on the target's gaze embedding. A lightweight gaze encoder extracts the embedding, while a separate appearance encoder preserves spatial structure. Appearance tokens undergo self-attention and cross-attention (conditioned on the gaze embedding) before being decoded. Attention maps also guide a gaze-aware loss focused on gaze-relevant regions. At inference, only the gaze encoder and a lightweight calibration module are used to predict gaze.
  • Figure 2: Sample images from our off-axis VRGaze dataset (left), example images from on-axis OpenEDS2020 dataset (center), 2D gaze angle distribution of VRGaze data (right).
  • Figure 3: Disentanglement analysis of learned embeddings. Left: Appearance Perturbations — same gaze, varying appearance. Right: Gaze Perturbations — same appearance, varying gaze. Gaze embeddings vary primarily with gaze direction, while appearance embeddings remain stable, demonstrating effective disentanglement.
  • Figure 4: Appearance source frames and their associated self-attention maps. The model learns to focus on regions that exhibit the greatest differences between the source and target images---primarily the gaze-relevant areas around the iris.
  • Figure 5: Latent space interpolation and gaze redirection. We interpolate between two target gaze embeddings ($g_{t_1} \rightarrow g_{t_2}$, top green boxes) while keeping the appearance embeddings ($A_{s_1}, A_{s_2}, A_{s_3}$, left blue boxes) fixed for each row. The smooth eye movement between the gaze targets, combined with the preservation of the source images' unique appearance, visually confirms effective disentanglement and a continuous latent manifold.
  • ...and 8 more figures