ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh; Tushar Nagarajan; Georgios Pavlakos; Kris Kitani; Kristen Grauman

ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman

TL;DR

ExpertAF introduces a pioneering approach to generate actionable coaching feedback from video by producing both verbal expert commentary and an expert demonstration. It builds a weakly-supervised training set by augmenting Ego-Exo4D with large-language-model-driven annotations and learns a multimodal, token-based model that jointly handles video, pose, and text inputs to output tailored feedback. The method demonstrates state-of-the-art performance across soccer, basketball, and rock climbing in commentary generation, expert demonstration retrieval, and pose generation, with strong human preferences. This work enables accessible, personalized AI coaching by leveraging multi-modal signals and weak supervision to bridge the gap between observation and concrete, coach-like guidance.

Abstract

Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback (AF) from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's [29] videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full spectrum, actionable coaching-expert commentary, expert video retrieval, and expert pose generation-outperforming strong vision-language models on both established metrics and human preference studies.

ExpertAF: Expert Actionable Feedback from Video

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 9 figures, 2 tables)

This paper contains 23 sections, 4 equations, 9 figures, 2 tables.

Introduction
Related work
Method
Problem statement
Forming the expert feedback dataset
Architecture and training design
Implementation details
Experiments and results
Conclusion
Supplementary video
Expert feedback dataset
Prompt for expert commentary classification and body region tagging
Expert feedback classification examples
Visualization of the dataset
Dataset statistics
...and 8 more sections

Figures (9)

Figure 1: An example of expert feedback. When a player is dribbling the ball fast, they tend to lose control (top left). Our proposed method provides an expert commentary to the learner suggesting improvements (bottom). The method also provides an expert demonstration that shows the desired correction, where the player is maintaining smaller steps and body control (top right).
Figure 1: Results on automatic metrics (left) and human evaluation (right). We break down results for the three outputs---expert commentary generation, expert demonstration retrieval, and expert pose generation. Our method outperforms all baselines and prior work on all tasks. The last row "w/ full-sup" uses privileged input (the demo video $\bar{\mathcal{V}}$) at inference. (B@4: BLEU-4, M: Meteor, R-L: ROUGE-L F1, R: recall@50, medR: median rank, P: PA-MPJPE). For all metrics higher is better, except medR and PA-MPJPE ($\downarrow$). Our method is also rated higher by human raters on a Likert scale (min:1, max:4), compared to all the other methods (right). See text for details.
Figure 2: Overview of the dataset creation. We first summarize the human-provided expert commentary egoexo4d in one sentence using an LLM, and then map it to a body region and correct (green) or incorrect (red) execution label. We then choose incorrect-correct pairs for the same body region to obtain $\mathcal{C}$. Finally, we choose pairs with minimum temporal alignment loss to obtain the training data. Best in zoom.
Figure 3: Model overview. We tokenize individual modalities using a modality-specific architecture (top). Once all the modalities are encoded as tokens, we use a large language model to learn expert commentary generation, demonstration retrieval, and pose generation. At inference, the model only takes the learner demonstration video $\mathcal{V}$. See text for details.
Figure 4: Qualitative results. (Top) Comparison of expert commentary generated by various baselines. (Second and third row) Examples of expert commentary generation, demonstration retrieval, and pose generation by our method. Notice the expert demonstration and pose generation corrects the mistake pointed out in the commentary, i.e., one hand is used to throw and the other to guide, and body position is improved to control better control and power. Colored text and marks (red for mistake, green for correction) are shown only for visualization. (Bottom) Failure cases. See Supp. for video results.
...and 4 more figures

ExpertAF: Expert Actionable Feedback from Video

TL;DR

Abstract

ExpertAF: Expert Actionable Feedback from Video

Authors

TL;DR

Abstract

Table of Contents

Figures (9)