Table of Contents
Fetching ...

SportSkills: Physical Skill Learning from Sports Instructional Videos

Kumar Ashutosh, Chi Hsuan Wu, Kristen Grauman

Abstract

Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

SportSkills: Physical Skill Learning from Sports Instructional Videos

Abstract

Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

Paper Structure

This paper contains 21 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Our proposed SportSkills, sourced from YouTube youtube, contains paired videos and instructional narrations describing the correct execution of a skill (top), totaling $638,399$ video clips. All the visuals are meant for skill-improvement. We propose this dataset to enable video models to learn physical skills. SportSkills contains $55$ popular sports (bottom), ranging from judo, sprinting, to tennis and handball.
  • Figure 2: SportSkills collection overview. We develop a curation pipeline to create a solid skill-learning video-language dataset sourced from in-the-wild coaching/how-to videos. We prompt an LLM gpt4 to generate a list of physical sports, and sport-specific skills. We use connecting words like 'tutorial', 'drills' to create search terms (left). Next, we query YouTube youtube with these queries and obtain narrations. We use an LLM gpt4 to filter out non-actionable instances, , "click the link to get discounts..." (middle). Finally, we use a VLM qwen25vl to obtain pairs of $(v, t)$ with correct or incorrect demonstrations, filtering out cases without visual demonstrations (right).
  • Figure 3: Overview of our method to retrieve visual feedback given a suboptimal execution by a learner (top left). We finetune a VLM qwen25vl, in LoRA lora setting, to predict if a given feedback candidate is relevant or not. Top right: overview of the expert annotation task where they see the learner video, actionable feedback in text form, and the candidate YouTube video as visual feedback, and they rate the relevance of the pair to create CoachGT. Bottom: our weakly supervised training data construction.
  • Figure 4: Average recall$@10$ for 20 sports. We show the performance of image and video-based visual encoder w/ and w/o SportSkills as the training dataset (orange and blue shades, respectively). We see a clear increase in the encoders trained with our proposed dataset. See Appendix for the performance on the remaining sports.
  • Figure 5: Retrieval-based skill corrections. We show representative outputs from our proposed model that provides visual feedback on a suboptimal learner performance, given only their video (left column). We do not input the expert commentary (second column) during either training or testing; it is given here for illustration. We see that the retrievals by our method provide personalized corrective feedback. On the other hand, retrievals from the zero-shot retrieval baseline are not helpful in skill improvement. (Last row) We also include a few failure cases that showcase the difficulty of the task. Recall that the learner videos are sourced from egoexo4d, and the feedback videos are from our proposed SportSkills.
  • ...and 5 more figures