Table of Contents
Fetching ...

SkillSight: Efficient First-Person Skill Assessment with Gaze

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

TL;DR

SkillSight tackles automatic skill assessment from egocentric data by using gaze as a powerful, low-power cue. It introduces a two-stage framework: a teacher model (SkillSight-T) that fuses egocentric video $V$ and gaze $G$ across action–gaze interaction, attended-object sequences, and gaze dynamics to predict skill $S$, and a gaze-only student model (SkillSight-S) that distills the teacher’s knowledge so inference relies only on $G$. Across cooking, music, and sports, SkillSight-T achieves state-of-the-art performance, while SkillSight-S delivers competitive accuracy with a substantial power saving of up to $73\times$ relative to video-based methods, enabling practical, in-the-wild deployments on smart glasses. The results also reveal insightful gaze–skill relationships that align with psychology literature, supporting both practical AI-assisted learning and deeper cognitive understanding of expertise in real-world tasks.

Abstract

Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

SkillSight: Efficient First-Person Skill Assessment with Gaze

TL;DR

SkillSight tackles automatic skill assessment from egocentric data by using gaze as a powerful, low-power cue. It introduces a two-stage framework: a teacher model (SkillSight-T) that fuses egocentric video and gaze across action–gaze interaction, attended-object sequences, and gaze dynamics to predict skill , and a gaze-only student model (SkillSight-S) that distills the teacher’s knowledge so inference relies only on . Across cooking, music, and sports, SkillSight-T achieves state-of-the-art performance, while SkillSight-S delivers competitive accuracy with a substantial power saving of up to relative to video-based methods, enabling practical, in-the-wild deployments on smart glasses. The results also reveal insightful gaze–skill relationships that align with psychology literature, supporting both practical AI-assisted learning and deeper cognitive understanding of expertise in real-world tasks.

Abstract

Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

Paper Structure

This paper contains 23 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Skill assessment with gaze. Experts and novices exhibit distinct attention behaviors, influencing both how they move their head and eyes and what they see, as illustrated here with clips from an expert (top) and novice (bottom) basketball layup from egoexo4d. The proposed method explores the associations between gaze, action, and expertise to achieve accurate and power-efficient skill assessment, using either ego-video and gaze, or gaze alone. The blue ray indicates gaze direction and depth, while shading shows camera motion over past frames. Note: leftmost third-person timelapses and commentary text are for illustration only.
  • Figure 2: Left: Overview of SkillSight-Teacher. We incorporate three components that encode action and gaze correlation, attended object sequence, and gaze trajectory for skill assessment. These features are fused by the fusion layer for prediction. Right: Overview of distillation method.SkillSight-Student learns to distill knowledge from the teacher feature $[e_v,e_c,e_g]$ using the distillation token $t_{dis}$. As guidance for evaluating skill in context, the student model performs subtask recognition with the action recognition token $t_{act}$.
  • Figure 3: What does an expert vs. novice tend to see more of? In these distributions, each patch crops the egocentric frame based on the subject's gaze coordinates. Our representation surfaces interesting patterns, like (left two boxes) how novice pianists fixate on their hands more often than experts do (77% vs. 45%, as quantified with hand detection), or (right two boxes) how bouldering experts exhibit greater gaze depth (1.4 m vs. 1.1 m) as they analyze moves further up the wall, resulting in smaller rocks in the crops. These patterns emerging from in-the-wild video are consistent with and even deepen prior findings from psychology psy_eyespan.
  • Figure 4: Qualitative results. Both SkillSight-T and SkillSight-S better predict skill level than prior work. Experts and novices show distinct gaze patterns consistent with Ego-Exo4D egoexo4d expert commentaries, shown for reference but not used by any model. The last example (bottom right) shows a failure case, highlighting the challenge of assessing skill from subtle movements. Blue rays show gaze direction and depth, and frustrum/ray shading indicates recent glasses motion. Ground-truth labels range from 1 (novice) to 4 (late expert).
  • Figure 5: Power–accuracy tradeoff. SkillSight-T outperforms all baselines, while SkillSight-S achieves the second-best accuracy and consumes the least energy.
  • ...and 2 more figures