SkillSight: Efficient First-Person Skill Assessment with Gaze
Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman
TL;DR
SkillSight tackles automatic skill assessment from egocentric data by using gaze as a powerful, low-power cue. It introduces a two-stage framework: a teacher model (SkillSight-T) that fuses egocentric video $V$ and gaze $G$ across action–gaze interaction, attended-object sequences, and gaze dynamics to predict skill $S$, and a gaze-only student model (SkillSight-S) that distills the teacher’s knowledge so inference relies only on $G$. Across cooking, music, and sports, SkillSight-T achieves state-of-the-art performance, while SkillSight-S delivers competitive accuracy with a substantial power saving of up to $73\times$ relative to video-based methods, enabling practical, in-the-wild deployments on smart glasses. The results also reveal insightful gaze–skill relationships that align with psychology literature, supporting both practical AI-assisted learning and deeper cognitive understanding of expertise in real-world tasks.
Abstract
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.
