CigTime: Corrective Instruction Generation Through Inverse Motion Editing
Qihang Fang, Chengcheng Tang, Bugra Tekin, Yanchao Yang
TL;DR
CigTime tackles generating corrective instructional text from motion pairs by treating it as the inverse of motion editing. It builds a data-efficient pipeline that uses a pre-trained motion editor to produce source-target triplets, tokenizes motions with a VQ-VAE, and fine-tunes a large language model to map motion discrepancies to actionable instructions. Across extensive evaluations, CigTime achieves superior corrective instruction quality and reconstruction accuracy compared to strong baselines, demonstrating robust performance across different editors and datasets. This work advances language-grounded coaching and motor-skill learning by enabling precise, context-aware textual feedback that guides users to correct and improve dynamic movements.
Abstract
Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, we investigate the inverse problem: generating corrective instructional text, leveraging motion editing and generation models. We introduce a novel approach that, given a user's current motion (source) and the desired motion (target), generates text instructions to guide the user towards achieving the target motion. We leverage large language models to generate corrective texts and utilize existing motion generation and editing frameworks to compile datasets of triplets (source motion, target motion, and corrective text). Using this data, we propose a new motion-language model for generating corrective instructions. We present both qualitative and quantitative results across a diverse range of applications that largely improve upon baselines. Our approach demonstrates its effectiveness in instructional scenarios, offering text-based guidance to correct and enhance user performance.
