Table of Contents
Fetching ...

3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation

Yihua Cheng, Hengfei Wang, Zhongqun Zhang, Yang Yue, Bo Eun Kim, Feng Lu, Hyung Jin Chang

TL;DR

This work tackles cross-task adaptation of gaze estimation by using a pre-trained 3D gaze model to perform 2D gaze prediction on unseen devices with only a few labeled samples. It introduces a physics-based differentiable projection module with learnable screen pose parameters $\\mathbf{r} \\in \\mathbb{R}^3$ and $\\mathbf{t} \\in \\mathbb{R}^3$ to map 3D gaze to 2D, enabling seamless integration into existing 3D models. A dynamic pseudo-labeling strategy flips 2D labels by reversing projection through a learned coordinate transformation $\\mathcal{T}$ (computed via SVD) to align coordinate systems across training iterations, while minimizing uncertainty across jittered images for robustness. The method is evaluated on MPIIGaze, EVE, and GazeCapture, showing consistent improvements over state-of-the-art projection-based approaches under few-shot settings, and demonstrating practical potential for real-world, device-agnostic gaze estimation without screen calibration or data sharing. Overall, the paper provides a principled, interpretable pathway to leverage 3D gaze priors for rapid, robust 2D gaze estimation across diverse devices.

Abstract

3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for this misalignment. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices. The superior performance highlights the effectiveness of our approach, and demonstrates its strong potential for real-world applications.

3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation

TL;DR

This work tackles cross-task adaptation of gaze estimation by using a pre-trained 3D gaze model to perform 2D gaze prediction on unseen devices with only a few labeled samples. It introduces a physics-based differentiable projection module with learnable screen pose parameters and to map 3D gaze to 2D, enabling seamless integration into existing 3D models. A dynamic pseudo-labeling strategy flips 2D labels by reversing projection through a learned coordinate transformation (computed via SVD) to align coordinate systems across training iterations, while minimizing uncertainty across jittered images for robustness. The method is evaluated on MPIIGaze, EVE, and GazeCapture, showing consistent improvements over state-of-the-art projection-based approaches under few-shot settings, and demonstrating practical potential for real-world, device-agnostic gaze estimation without screen calibration or data sharing. Overall, the paper provides a principled, interpretable pathway to leverage 3D gaze priors for rapid, robust 2D gaze estimation across diverse devices.

Abstract

3D and 2D gaze estimation share the fundamental objective of capturing eye movements but are traditionally treated as two distinct research domains. In this paper, we introduce a novel cross-task few-shot 2D gaze estimation approach, aiming to adapt a pre-trained 3D gaze estimation network for 2D gaze prediction on unseen devices using only a few training images. This task is highly challenging due to the domain gap between 3D and 2D gaze, unknown screen poses, and limited training data. To address these challenges, we propose a novel framework that bridges the gap between 3D and 2D gaze. Our framework contains a physics-based differentiable projection module with learnable parameters to model screen poses and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Additionally, we introduce a dynamic pseudo-labelling strategy for flipped images, which is particularly challenging for 2D labels due to unknown screen poses. To overcome this, we reverse the projection process by converting 2D labels to 3D space, where flipping is performed. Notably, this 3D space is not aligned with the camera coordinate system, so we learn a dynamic transformation matrix to compensate for this misalignment. We evaluate our method on MPIIGaze, EVE, and GazeCapture datasets, collected respectively on laptops, desktop computers, and mobile devices. The superior performance highlights the effectiveness of our approach, and demonstrates its strong potential for real-world applications.

Paper Structure

This paper contains 20 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We introduce a novel cross-task few-shot 2D gaze estimation approach. Our method leverages a pre-trained 3D gaze estimation network and few-shot 2D gaze samples to achieve 2D gaze estimation on unseen devices. It contains a physics-based differentiable projection module to bridge 3D and 2D gaze, along with a dynamic pseudo-labelling strategy for 2D labels under unknown screen poses. Our approach is both screen-calibration-free and source-free, significantly expanding its application potential.
  • Figure 2: We propose a framework for the cross-task few-shot 2D gaze estimation. The framework contains a physics-based differentiable projection module with learnable parameters $\mathbf{r}$ and $\mathbf{t}$ to model screen, and project 3D gaze into 2D gaze. The framework is fully differentiable and can integrate into existing 3D gaze networks without modifying their original architecture. Leveraging this framework, we can quickly adapt a 3D gaze model for 2D gaze estimation using only a small number of images.
  • Figure 3: The dynamic pseudo-labeling strategy for 2D gaze involves reversing the projection process to convert 2D gaze into 3D space, where we compute pseudo-labels. To align the camera coordinate system (CCS) with the unknown coordinate system (UCS), we use the same image sets as input to both the initial and the updated 3D model. The initial model, trained on the CCS, while the updated model operates within the UCS. By leveraging the outputs from these models as two anchors, we derive the transformation $\mathcal{T}$ to align the coordinate systems. Notably, $\mathcal{T}$ should be invertible.
  • Figure 4: We compare the performance across different pseudo-labelling strategies. The red bar represents the projection without pseudo-labelling, serving as a baseline for comparison. We evaluated our method without the transformation $\mathcal{T}$. The unreliable pseudo-labels lead to significant performance drop on the MPIIGaze and EVE. Interestingly, omitting $\mathcal{T}$ led to improved results on the GazeCapture dataset. We found that this was because the initial screen pose happened to be same as the actual screen pose.
  • Figure 5: Performance with different number of training images.
  • ...and 2 more figures