Table of Contents
Fetching ...

Learning Skill-Attributes for Transferable Assessment in Video

Kumar Ashutosh, Kristen Grauman

TL;DR

CrossTrainer tackles the challenge of transferable video-based skill assessment by learning universal, fine-grained skill-attributes from video-language supervision and then using a multimodal language model to produce actionable feedback and proficiency estimates for unseen sports. The method uses a two-stage training regime: Stage I discovers attribute concepts from expert commentary to supervise a video-to-attribute mapper, and Stage II uses these attributes to generate feedback and estimate proficiency. Evaluated on Ego-Exo4D, QEVD, and YouTube in-the-wild data, CrossTrainer achieves up to 60% relative gains over baselines and shows graceful degradation in zero-shot transfer to novel sports. This work provides a scalable path toward cross-domain skill assessment and coaching by abstracting execution patterns into transferable skill-attributes that enrich multimodal reasoning. The approach holds potential to democratize expert coaching for long-tail sports and mixed-discipline activities through accessible video analysis and feedback generation.

Abstract

Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better. Today's models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning -- whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., "lift hands more to generate more power" as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today's multimodal large language models.

Learning Skill-Attributes for Transferable Assessment in Video

TL;DR

CrossTrainer tackles the challenge of transferable video-based skill assessment by learning universal, fine-grained skill-attributes from video-language supervision and then using a multimodal language model to produce actionable feedback and proficiency estimates for unseen sports. The method uses a two-stage training regime: Stage I discovers attribute concepts from expert commentary to supervise a video-to-attribute mapper, and Stage II uses these attributes to generate feedback and estimate proficiency. Evaluated on Ego-Exo4D, QEVD, and YouTube in-the-wild data, CrossTrainer achieves up to 60% relative gains over baselines and shows graceful degradation in zero-shot transfer to novel sports. This work provides a scalable path toward cross-domain skill assessment and coaching by abstracting execution patterns into transferable skill-attributes that enrich multimodal reasoning. The approach holds potential to democratize expert coaching for long-tail sports and mixed-discipline activities through accessible video analysis and feedback generation.

Abstract

Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better. Today's models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning -- whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., "lift hands more to generate more power" as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today's multimodal large language models.

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the idea. Given a short video of an athletic skill, what could be better? Given video demonstrations from multiple sports, we learn skill-attributes that are incorrectly demonstrated, e.g., wrong foot positioning in badminton (top). These skill-attributes are common across various sports, and transfer to novel uncommon sports, e.g., shinty (bottom). Our method improves both the in-domain and zero-shot settings. Sports chosen for illustration; see Sec. \ref{['sec:expts']} for dataset details.
  • Figure 2: Discovered skill-attributes from Ego-Exo4D egoexo4d (left) and QEVD qevd-panchal (right). We see phrases reflecting generalizable physical concepts like control, hand/body positioning, and movement.
  • Figure 3: Method overview and evaluation settings. (Left) We encode videos into tokens $\mathbf{v}$ that can be fed to a multimodal LLM $\mathcal{L}$, with a mapper $f_m$ that trains for skill-attributes. We use these visual tokens to generate skill-attributes (bottom left). Next, this pretraining is used to generate actionable feedback (bottom middle) and proficiency score (bottom right). (Right) Example of the various training settings for in-domain and zero-shot.
  • Figure 4: Zero-shot performance. Performance trend when testing on various skills (dribbling, penalty, etc.) for different in-domain and zero-shot training settings (FS, ZS-1, etc.) for skill-attribute generation (top) and actionable feedback generation (bottom) for Ego-Exo4D. The relative drop in performance w.r.t. FS is shown as a percentage. Our method is consistently the best for all methods, and the relative drop is the least for all zero-shot variants (ideal curve would be flat and high). See Supp. for QEVD. Legend: CrossTrainer, ExpertAF expertaf and InternVideo2 internvideo2.
  • Figure 5: Qualitative results.CrossTrainer generates skill-attributes, actionable feedback, and proficiency for samples from both Ego-Exo4D egoexo4d and QEVD qevd-panchal. The outputs are meaningful even in the zero-shot setting (first two rows). Our method is also applied to in-the-wild videos from YouTube with novel sports (frisbee and water polo) and even new drills (juggling in soccer) with feedback matching the YouTube expert's comments (transcribed with ASR) (third row). Confusion matrix shows better transfer between related sports (bottom left). Failure cases here and in Supp. show the difficulty of the task, especially non-visual feedback like lacking intent (bottom right).
  • ...and 1 more figures