Table of Contents
Fetching ...

CigTime: Corrective Instruction Generation Through Inverse Motion Editing

Qihang Fang, Chengcheng Tang, Bugra Tekin, Yanchao Yang

TL;DR

CigTime tackles generating corrective instructional text from motion pairs by treating it as the inverse of motion editing. It builds a data-efficient pipeline that uses a pre-trained motion editor to produce source-target triplets, tokenizes motions with a VQ-VAE, and fine-tunes a large language model to map motion discrepancies to actionable instructions. Across extensive evaluations, CigTime achieves superior corrective instruction quality and reconstruction accuracy compared to strong baselines, demonstrating robust performance across different editors and datasets. This work advances language-grounded coaching and motor-skill learning by enabling precise, context-aware textual feedback that guides users to correct and improve dynamic movements.

Abstract

Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, we investigate the inverse problem: generating corrective instructional text, leveraging motion editing and generation models. We introduce a novel approach that, given a user's current motion (source) and the desired motion (target), generates text instructions to guide the user towards achieving the target motion. We leverage large language models to generate corrective texts and utilize existing motion generation and editing frameworks to compile datasets of triplets (source motion, target motion, and corrective text). Using this data, we propose a new motion-language model for generating corrective instructions. We present both qualitative and quantitative results across a diverse range of applications that largely improve upon baselines. Our approach demonstrates its effectiveness in instructional scenarios, offering text-based guidance to correct and enhance user performance.

CigTime: Corrective Instruction Generation Through Inverse Motion Editing

TL;DR

CigTime tackles generating corrective instructional text from motion pairs by treating it as the inverse of motion editing. It builds a data-efficient pipeline that uses a pre-trained motion editor to produce source-target triplets, tokenizes motions with a VQ-VAE, and fine-tunes a large language model to map motion discrepancies to actionable instructions. Across extensive evaluations, CigTime achieves superior corrective instruction quality and reconstruction accuracy compared to strong baselines, demonstrating robust performance across different editors and datasets. This work advances language-grounded coaching and motor-skill learning by enabling precise, context-aware textual feedback that guides users to correct and improve dynamic movements.

Abstract

Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, we investigate the inverse problem: generating corrective instructional text, leveraging motion editing and generation models. We introduce a novel approach that, given a user's current motion (source) and the desired motion (target), generates text instructions to guide the user towards achieving the target motion. We leverage large language models to generate corrective texts and utilize existing motion generation and editing frameworks to compile datasets of triplets (source motion, target motion, and corrective text). Using this data, we propose a new motion-language model for generating corrective instructions. We present both qualitative and quantitative results across a diverse range of applications that largely improve upon baselines. Our approach demonstrates its effectiveness in instructional scenarios, offering text-based guidance to correct and enhance user performance.

Paper Structure

This paper contains 32 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of CigTime. Left: We leverage source motion tokens and corrective instructions as input to a motion editor to produce target motion tokens. Right: We then employ a language model to generate precise corrective instructions based on a given source and target motion. We demonstrate in the example generating corrective instructions for lifting weights with the upper body.
  • Figure 2: Template for LLM fine-tuning. The LLM is required to output the corrective instructions given token lists for the source and target motion sequences (i.e., Action 1 and Action 2) as well as instructions on the expected output.
  • Figure 3: Visualization of corrective instructions and reconstructed motions for different methods.
  • Figure 4: In-context learning for corrective instruction generation. The prompt for the LLMs in in-context learning includes a task description and several examples. This information is given to the LLMs, instructing them to generate correctional instructions for new motion pairs.
  • Figure 5: Diversity of the corrective instructions. We present some examples where the reconstructed motions have a similar appearance to the target motions, but the corrective instructions still differ from the ground truth, demonstrating the robustness of our approach generating effective and semantically meaningful corrective instructions.
  • ...and 3 more figures