Table of Contents
Fetching ...

Rapidly deploying on-device eye tracking by distilling visual foundation models

Cheng Jiang, Jogendra Kundu, David Colmenares, Fengting Yang, Joseph Robinson, Yatong An, Ali Behrooz

Abstract

Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

Rapidly deploying on-device eye tracking by distilling visual foundation models

Abstract

Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

Paper Structure

This paper contains 34 sections, 13 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: (a) VFM embeddings cluster by subject, obscuring gaze structure. t-SNE visualization of DINOv3 embeddings for near-eye images from 64 random subjects. Embeddings cluster globally by subject rather than gaze; however, smooth gradients within each cluster correlate with gaze direction, indicating gaze cues are present but dominated by identity. (b) Workflow overview. Linear probing DINOv3 with synthetic supervision (gray) significantly underperforms our on-device baseline (purple), despite $\sim$100$\times$ more parameters. This motivates a two-stage approach: (i) optimize a ViT-B teacher from DINOv3; (ii) distill to an on-device student.
  • Figure 2: Overview of DistillGaze. (a) We optimize a VFM using synthetic supervision and self-distillation on unlabeled real data. (b) The optimized VFM is distilled to a lightweight student (256K parameters), with an EMA student providing complementary supervision. Only the student is deployed.
  • Figure 3: Population coverage across error percentiles. We plot the cumulative distribution of users at E50, E75, and E90. Our optimized VFM and DistillGaze student exhibit closely matched performance, demonstrating effective knowledge transfer. Both methods approach the fully supervised on-device upper bound (yellow).
  • Figure 4: Qualitative comparisons of gaze predictions. Pitch-yaw plots (degrees) compare ground truth (green) against five methods. The frozen DINOv3 linear probe (gray) yields large errors, confirming that VFM features do not directly transfer to gaze estimation. Our distilled on-device student (magenta) achieves accuracy comparable to the optimized VFM teacher (blue) while requiring 200$\times$ fewer parameters.
  • Figure 5: Ablations on VFM architecture. Gaze error (E50U50 $\downarrow$, $^\circ$) vs. parameter count for ConvNeXt-S and ViT-B backbones. Linear probing (gray) performs poorly despite high parameter counts. Synthetic fine-tuning and our optimization substantially reduce error. Distilled on-device students (left) approach optimized VFM accuracy with 100–200$\times$ fewer parameters.
  • ...and 5 more figures