Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

Mathis Petrovich; Or Litany; Umar Iqbal; Michael J. Black; Gül Varol; Xue Bin Peng; Davis Rempe

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, Davis Rempe

TL;DR

The paper tackles the need for fine-grained, timeline-based control in text-driven 3D human motion synthesis by introducing a multi-track timeline interface and a test-time denoising method called Spatio-Temporal Motion Collage (STMC). STMC denoises per-prompt motion crops and stitches them both spatially (via body-part associations) and temporally (via DiffCollage) to realize coherent, complex animations from overlapping prompts. A SMPL-enabled diffusion variant (MDM-SMPL) speeds up sampling and yields direct SMPL outputs, while a new Multi-track Timeline (MTT) dataset provides ground-truth grounding for evaluation. Across quantitative metrics (semantic alignment, realism, transitions) and perceptual studies, STMC demonstrates improvements over single-prompt baselines, DiffCollage, and SINC baselines, offering a practical, scalable path toward animator-friendly text-to-motion systems.

Abstract

Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

TL;DR

Abstract

Paper Structure (46 sections, 4 equations, 7 figures, 2 tables)

This paper contains 46 sections, 4 equations, 7 figures, 2 tables.

Introduction
Related Work
Human motion synthesis
Motion composition
Controllable motion diffusion
Human Motion Synthesis from Timelines
Timeline Control Problem Formulation
Inputs
Outputs
Background: Motion Diffusion Models
STMC: Spatio-Temporal Motion Collage
Motion cropping and denoising
Spatial (body-part) stitching
Temporal stitching
SMPL Support for Motion Diffusion Model
...and 31 more sections

Figures (7)

Figure 1: Multi-track timeline control: We introduce a new problem setting for text-driven motion synthesis, where the input consists of parallel tracks allowing simultaneous actions, as well as continuous temporal intervals enabling sequential actions. A long and complex motion can be generated (top) given the structured input of multiple simple textual descriptions, each corresponding to a temporal interval (bottom).
Figure 2: Text-driven motion synthesis tasks: Our framework generalizes (a) traditional text-to-motion synthesis given one text and one duration, (b) temporal composition given a sequence of texts for non-overlapping intervals, and (c) spatial composition given a set of texts for a single interval. (d) Multi-track timeline control uses a set of texts for arbitrary intervals, allowing fine-grained control over the timings of several complex actions.
Figure 3: Overview of STMC: Before denoising, the multi-track timeline is first (a) partitioned into relevant body parts per text (using LLM-based labeling SINC:2023) to create body part timelines, which are then (b) extended to overlap, leading to the transition intervals used for temporal stitching per body part with DiffCollage zhange2023diffcollage. (c) At each denoising step, motions for each prompt are denoised independently before being combined based on the body-part timelines. The composite motion is re-noised by sampling ${\bm{x}}_{t-1}$ from $\mathcal{N}(\mu_t({\bm{x}}_t, \bm{\hat{x}_0}), \bm{\Sigma}_t)$ (as in \ref{['eqn:posterior']}) before being passed to the next step.
Figure 4: Perception study results: Our STMC method is preferred over baselines by human raters for both motion realism and semantic accuracy. (Left) Comparison against the strong SINC with Lerp baseline. (Right) Comparison against the DiffCollage baseline. MDM tevet2023mdm is used as the denoiser in these experiments.
Figure 5: Qualitative results: We visualize the results of STMC with MDM-SMPL on several input timelines and color the bodies depending on their location in the timeline. We see that STMC is capable of generating realistic motions, which capture the semantics of the given text prompts with the desired timing and duration. In (a) and (c), STMC generates motions that precisely follow the instructions, controlling a single arm while still performing another action. The accurate timing of intervals is demonstrated in (b) where the arms are still up in the air when transitioning from "walking" to "jumping", which is difficult to achieve with alternative methods. In (c) and (d), we observe that STMC is capable of generating compositions that were not present in the ground truth data, such as "walking backwards while eating" or "walking while playing violin".
...and 2 more figures

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

TL;DR

Abstract

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)