Table of Contents
Fetching ...

MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

Payam Jome Yazdian, Rachel Lagasse, Hamid Mohammadi, Eric Liu, Li Cheng, Angelica Lim

TL;DR

The paper addresses the challenge of generating fine-grained, expressive 3D human motions from natural language by bridging 3D motion data and LLM reasoning. It introduces MotionScript, a pipeline that converts 3D joint trajectories into structured, temporal captions through posecodes and motioncodes, augmented via automatic extraction and aggregation. Training a T2M model on MotionScript-augmented data yields more expressive and out-of-distribution motions, validated by large-scale human studies. This framework enables interpretable control, supports animation, robotics, and virtual humans, and provides a scalable path to leveraging LLMs for motion synthesis.

Abstract

We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.

MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

TL;DR

The paper addresses the challenge of generating fine-grained, expressive 3D human motions from natural language by bridging 3D motion data and LLM reasoning. It introduces MotionScript, a pipeline that converts 3D joint trajectories into structured, temporal captions through posecodes and motioncodes, augmented via automatic extraction and aggregation. Training a T2M model on MotionScript-augmented data yields more expressive and out-of-distribution motions, validated by large-scale human studies. This framework enables interpretable control, supports animation, robotics, and virtual humans, and provides a scalable path to leveraging LLMs for motion synthesis.

Abstract

We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.
Paper Structure (18 sections, 3 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Motionscript provides a structured language for open-vocabulary, fine-grained motion descriptions. By integrating LLMs and a motionscript fine-tuned text-to-motion model, this pipeline enables out-of-distribution motion generation where standard text-to-motion models perform sub-optimally.
  • Figure 2: The proposed MotionScript framework converts a sequence of 3D poses into a sequence of posecodes, detects and selects important motions, and finally aggregates them and converts them into text.
  • Figure 3: Example motion sequence, dynamic motion segmentation with detected MotionCodes (Algorithm 1), and the resulting MotionScript, a structured motion description.
  • Figure 4: Comparison of motion generation from T2M trained on human annotations (left) and T2M(MS) trained on MotionScript-augmented data (right), using plain text (top) and LLM-enhanced prompts (bottom).
  • Figure 5: Experiment 2 results comparing T2M(MS) (blue) and T2M(LLM) (red). We show mean ratings with standard deviation across 34 evaluation questions from 1 (worst) to 7 (best). MotionScript (blue) shows a trend in outperforming the baseline, with significant differences marked with a star (*).