Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Chenkai Zhang

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Chenkai Zhang

Abstract

Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Abstract

Paper Structure (60 sections, 14 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 60 sections, 14 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Contributions.
Related work
Remote gaze estimation and webcam systems.
Landmark-based and landmark-guided gaze estimation.
Calibration and personalization.
Meta-learning with closed-form adaptation.
Equivariant representations for geometric inputs.
Smooth-pursuit paradigms.
Problem formulation
EMC-Gaze
Overview
Landmark graph construction
E(3)-equivariant landmark-graph encoder
Normalization.
...and 45 more sections

Figures (7)

Figure 1: EMC-Gaze overview. Landmark-only webcam gaze tracking is decomposed into a shared E(3)-equivariant landmark-graph encoder, a closed-form per-session ridge calibration head fitted on a short support set, and low-capacity deployment refinements. During meta-training, the encoder is optimized through the ridge solution together with canonicalization-consistency and pursuit-continuity regularization.
Figure 2: Accuracy. (Left) Distribution of overall angular RMSE across the 33 interactive evaluation runs. Each point is one run; boxes summarize the run distribution. (Right) Mean angular error per target location (degrees) for EMC-Gaze on the representative run.
Figure 3: Efficient calibration. Aggregate angular RMSE (degrees) versus number of calibration targets $k$ across the 33 interactive evaluation runs. Curves show mean performance; shaded bands show $\pm1$ standard deviation. The top axis reports approximate average calibration time for the EMC-Gaze prefix curve.
Figure 4: Robustness to head pose. Mean per-phase angular RMSE (degrees) across the 33 interactive evaluation runs under yaw/pitch/roll pose-hold blocks (still-head calibration support only). Error bars denote $\pm1$ standard deviation across runs.
Figure 5: Stability. CDF of within-target prediction jitter (degrees) on the representative run, computed as the distance from each prediction to that target's mean prediction. Lower is more stable.
...and 2 more figures

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Abstract

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Authors

Abstract

Table of Contents

Figures (7)