Table of Contents
Fetching ...

Iterative Motion Editing with Natural Language

Purvi Goel, Kuan-Chieh Wang, C. Karen Liu, Kayvon Fatahalian

TL;DR

This work presents a natural-language-driven pipeline for iterative character motion editing that constraints edits to a small set of kinematic motion editing operators (MEOs). A large language model translates editing prompts and context into executable Python programs that assemble MEOs, which are then grounded in a source motion via keyframes and refined with a diffusion-based motion infilling model. The approach demonstrates high fidelity to edit intent, preserves structural integrity of the original motion, and yields realistic animations, outperforming state-of-the-art text-to-motion baselines in both qualitative and quantitative evaluations, including user studies. By enabling conversational, iterative refinement, the method offers a practical pathway for precise motion edits in professional animation workflows and potentially broadens accessibility to text-driven animation tools.

Abstract

Text-to-motion diffusion models can generate realistic animations from text prompts, but do not support fine-grained motion editing controls. In this paper, we present a method for using natural language to iteratively specify local edits to existing character animations, a task that is common in most computer animation workflows. Our key idea is to represent a space of motion edits using a set of kinematic motion editing operators (MEOs) whose effects on the source motion is well-aligned with user expectations. We provide an algorithm that leverages pre-existing language models to translate textual descriptions of motion edits into source code for programs that define and execute sequences of MEOs on a source animation. We execute MEOs by first translating them into keyframe constraints, and then use diffusion-based motion models to generate output motions that respect these constraints. Through a user study and quantitative evaluation, we demonstrate that our system can perform motion edits that respect the animator's editing intent, remain faithful to the original animation (it edits the original animation, but does not dramatically change it), and yield realistic character animation results.

Iterative Motion Editing with Natural Language

TL;DR

This work presents a natural-language-driven pipeline for iterative character motion editing that constraints edits to a small set of kinematic motion editing operators (MEOs). A large language model translates editing prompts and context into executable Python programs that assemble MEOs, which are then grounded in a source motion via keyframes and refined with a diffusion-based motion infilling model. The approach demonstrates high fidelity to edit intent, preserves structural integrity of the original motion, and yields realistic animations, outperforming state-of-the-art text-to-motion baselines in both qualitative and quantitative evaluations, including user studies. By enabling conversational, iterative refinement, the method offers a practical pathway for precise motion edits in professional animation workflows and potentially broadens accessibility to text-driven animation tools.

Abstract

Text-to-motion diffusion models can generate realistic animations from text prompts, but do not support fine-grained motion editing controls. In this paper, we present a method for using natural language to iteratively specify local edits to existing character animations, a task that is common in most computer animation workflows. Our key idea is to represent a space of motion edits using a set of kinematic motion editing operators (MEOs) whose effects on the source motion is well-aligned with user expectations. We provide an algorithm that leverages pre-existing language models to translate textual descriptions of motion edits into source code for programs that define and execute sequences of MEOs on a source animation. We execute MEOs by first translating them into keyframe constraints, and then use diffusion-based motion models to generate output motions that respect these constraints. Through a user study and quantitative evaluation, we demonstrate that our system can perform motion edits that respect the animator's editing intent, remain faithful to the original animation (it edits the original animation, but does not dramatically change it), and yield realistic character animation results.
Paper Structure (27 sections, 3 equations, 6 figures, 2 tables)

This paper contains 27 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: System overview: Our system uses a LLM to translate a natural language editing instruction ($E$) into source code for a Python program that executes motion editing operations (MEOs). Our MEO execution engine applies MEOs to the source motion by first generating motion constraints (e.g., keyframes, retiming constraints). In the case shown above, E describes a sub-movement that should start at the beginning of the motion and lead to a pose in the future; the engine determines the explicit frame requiring editing. A diffusion-based motion infilling step then produces output motions that embody the desired edit, preserve the original motion when possible, and look realistic. Our system can be used in an iterative fashion.
  • Figure 2: LLM Prompt Specification. An abridged LLM prompt that contains MEO API information, an editing prompt $E$: "Can you get that kick higher out?" (with context $E_{ctx}$ "A person is doing a side kick with the right leg"), and an example MEO program for the task: "lift the right knee to the chest during a jump.", which serves to teach the LLM agent how to use the API. In practice, we provide several examples. The example program here makes API calls to create a plan for completing the editing task, by using MEO construction methods from our API and lists of joints/directions. We ask the LLM agent to write a program that performs the motion edit by combining $E$ and $E_{ctx}$ into a function header comment. The LLM completes the code by writing an MEO program under the header comment.
  • Figure 3: Motion notation. $\mathbf{X}_\text{S}$ is the source motion; condition C comprises $\mathbf{X}^{ \text{ctx}}_{ \text{S}}$ (context from $\mathbf{X}_\text{S}$) and edited keyframe(s) $\mathbf{x}^{ \text{key}}_{ \text{E}}$. Our diffusion-based execution engine outputs $\mathbf{X}_\text{E}$. Gray squares represent components of $\mathbf{X}_\text{S}$; blue squares represent components or desired components of $\mathbf{X}_\text{E}$.
  • Figure 4: Infilling Diffusion Model. In training, our model (left) learns to infill motions. G takes input, a noisy sequence imputed with $\mathbf{C}$, and cond, a masked verion of $\mathbf{C}$. At inference (right), we optionally integrate $\mathbf{X}_{\text{spline}}$ to guide inference. For each $t$ we spatially lerp the infilled frames of $\mathbf{X}_{\text{spline}}$ with those progressively generated by G with interpolant $\lambda(t)$, which decreases monotonically as a function of $t$.
  • Figure 5: Handling natural-language instructions. Starting from a source motion (left column, in purple) and editing instruction (italicized), our system produces plausible motions (right column, blue) that preserve the structure of the original motion and abide by the editing instruction.
  • ...and 1 more figures