Table of Contents
Fetching ...

Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

Arushi Rai, Adriana Kovashka

TL;DR

This work tackles the challenge of generating actionable sports feedback from videos and addresses the generalization gap when finetuning on a single sport. It introduces a cross-domain approach that leverages abundant auxiliary data from the target domain—competition commentary and coaching texts—alongside limited source-domain feedback, using LLM-based refinement and precise localization to produce well-aligned, high-quality feedback. The authors also propose two evaluation metrics, specificity and actionability, grounded in motor learning theory, and validate them with human annotations. Empirically, incorporating auxiliary data yields strong improvements in out-of-distribution feedback generation and demonstrates the complementary value of text data for enhancing actionability. The approach offers a practical, scalable pathway to domain-adaptive sports feedback with interpretable evaluation metrics that go beyond traditional lexical baselines.

Abstract

While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.

Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

TL;DR

This work tackles the challenge of generating actionable sports feedback from videos and addresses the generalization gap when finetuning on a single sport. It introduces a cross-domain approach that leverages abundant auxiliary data from the target domain—competition commentary and coaching texts—alongside limited source-domain feedback, using LLM-based refinement and precise localization to produce well-aligned, high-quality feedback. The authors also propose two evaluation metrics, specificity and actionability, grounded in motor learning theory, and validate them with human annotations. Empirically, incorporating auxiliary data yields strong improvements in out-of-distribution feedback generation and demonstrates the complementary value of text data for enhancing actionability. The approach offers a practical, scalable pathway to domain-adaptive sports feedback with interpretable evaluation metrics that go beyond traditional lexical baselines.

Abstract

While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.
Paper Structure (27 sections, 1 equation, 7 figures, 7 tables)

This paper contains 27 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Our method leverages expensive, strongly-aligned annotated video-feedback pairs from a source domain (basketball, soccer) alongside abundant, freely available auxiliary data from the target domain (rock climbing). The auxiliary data includes weakly-aligned and unrelated video-text pairs from YouTube and coaching textbooks. Through refinement and precise localization, we transform weakly-aligned into strongly-aligned training data, enabling effective cross-domain transfer in data-scarce settings.
  • Figure 2: Two-stage process to improve the quality and temporal localization of commentary. Top: The original ASR verbose text transcript is passed to an LLM prompted to summarize the commentary concisely and extract only action-relevant or action quality-relevant information, or skip if no such information is present. Bottom: Our precise localization technique is applied to refined commentary. For each refined commentary, the corresponding ASR timestamps are used to extract audio, then passed to Whisper to obtain word-level timestamps relative to the ASR segment start timestamp. Finally, the refined commentary and the word-level timestamps are passed to another LLM prompted to localize where the refined commentary is narrated relative to the ASR segment start timestamp. The output is a list of commentary segments (as each refined commentary may contain multiple parts) with precise offset timestamps.
  • Figure 3: LLM-based evaluation of actionability (top) and specificity (bottom) as introduced in Sec. \ref{['sec:eval_metrics']}. The lines indicate max/min over using GPT-4o, Gemini 2.5, and DeepSeek Chat.
  • Figure 5: Window ablation.$t_{\text{start}}$ and $t_{\text{end}}$ are the start and end timestamps produced by the precise localization step. For Windows around $t_{\text{start}}$, the mean and confidence interval are computed over the performance of the following windowing strategies when used for training: $(t_{\text{start}}, t_{\text{end}})$, $(t_{\text{start}}-3, t_{\text{start}}+1)$, $(t_{\text{start}}-4, t_{\text{start}})$, and $(t_{\text{start}}-4, t_{\text{end}})$. These windows experiment with different ways of including actions that may have occurred prior to the narration. Observe that the confidence interval is very small, indicating comparable performance when the window is slightly shifted earlier, but consistent improvement over lack of precise localization.
  • Figure : Looks up to anticipate the next moves. Timestamp (s): [28.0, 30.26]
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1: Specificity
  • Definition 2: Actionability