Table of Contents
Fetching ...

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish

TL;DR

This work observes that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty), and operationalizes this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty.

Abstract

Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.

Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

TL;DR

This work observes that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty), and operationalizes this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty.

Abstract

Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.
Paper Structure (62 sections, 4 equations, 5 figures, 16 tables)

This paper contains 62 sections, 4 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Quality--Novelty Decomposition.(a) Gaze confidence (x) captures stability; pupil response (y) captures novelty. Random includes junk; gaze-only yields clean but redundant frames; dual targets high stability and novelty. (b) Two-stage pipeline: gaze gate (top 75%) $\rightarrow$ pupil ranking within budget.
  • Figure 2: Correlation between physiological signals and DINOv2 feature change.(a) Pupil derivative $|dp/dt|$ is positively correlated with feature change at all lags (mean $\rho = +0.038$). (b) Gaze quality $g(t)$ is negatively correlated ($\rho = -0.037$), confirming it tracks stability. Error bars: $\pm 1$ s.d. across sessions.
  • Figure 3: Learning curves: activity (left), scene (right). Dual at 10% budget matches the performance achieved using all frames. Shaded: $\pm 1$ s.d. (10 seeds).
  • Figure 4: Task-Dependent Performance.(a) Activity: pupil ranking improves over gaze-only and random. (b) Scene: gaze-only dominates; pupil adds no benefit.
  • Figure 5: Qualitative Comparison. Top: dual-criterion (high gaze, high pupil). Bottom: random baseline, including blur and low-information content.