ExpertAF: Expert Actionable Feedback from Video
Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman
TL;DR
ExpertAF introduces a pioneering approach to generate actionable coaching feedback from video by producing both verbal expert commentary and an expert demonstration. It builds a weakly-supervised training set by augmenting Ego-Exo4D with large-language-model-driven annotations and learns a multimodal, token-based model that jointly handles video, pose, and text inputs to output tailored feedback. The method demonstrates state-of-the-art performance across soccer, basketball, and rock climbing in commentary generation, expert demonstration retrieval, and pose generation, with strong human preferences. This work enables accessible, personalized AI coaching by leveraging multi-modal signals and weak supervision to bridge the gap between observation and concrete, coach-like guidance.
Abstract
Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback (AF) from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's [29] videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full spectrum, actionable coaching-expert commentary, expert video retrieval, and expert pose generation-outperforming strong vision-language models on both established metrics and human preference studies.
