Table of Contents
Fetching ...

Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis

Seong-Eun Hong, JaeYoung Seon, JuYeong Hwang, JongHwan Shin, HyeongYeop Kang

TL;DR

Event-T2M introduces event-level conditioning for text-to-motion synthesis, addressing the failure of single-global embeddings to capture multi-action temporal structure. By decomposing prompts into events via an LLM, encoding each event with a motion-specialized Text-to-Motion Retrieval (TMR) encoder, and fusing them through event-based cross-attention within a Conformer-based diffusion backbone, the model preserves event order and transitions. The authors validate their approach on standard benchmarks and a new HumanML3D-E dataset stratified by event count, showing competitive performance on simple prompts and clear gains as complexity increases, corroborated by two human studies. This work demonstrates that explicit event-level representations can generalize text-to-motion generation to complex, compositional prompts, enabling more reliable integration into animation pipelines and embodied agents.

Abstract

Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground-truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts.

Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis

TL;DR

Event-T2M introduces event-level conditioning for text-to-motion synthesis, addressing the failure of single-global embeddings to capture multi-action temporal structure. By decomposing prompts into events via an LLM, encoding each event with a motion-specialized Text-to-Motion Retrieval (TMR) encoder, and fusing them through event-based cross-attention within a Conformer-based diffusion backbone, the model preserves event order and transitions. The authors validate their approach on standard benchmarks and a new HumanML3D-E dataset stratified by event count, showing competitive performance on simple prompts and clear gains as complexity increases, corroborated by two human studies. This work demonstrates that explicit event-level representations can generalize text-to-motion generation to complex, compositional prompts, enabling more reliable integration into animation pipelines and embodied agents.

Abstract

Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground-truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts.
Paper Structure (51 sections, 11 equations, 7 figures, 17 tables)

This paper contains 51 sections, 11 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Main Architecture of Event-T2M. An input prompt is split into clauses by an LLM, encoded as event tokens with a TMR encoder, and fused with a global token. Tokens guide the diffusion process through an event-level module, enabling generation of sequentially complex motions.
  • Figure 2: Overall comparison of Event-T2M: (a) As event counts increase ($\geq$1, $\geq$2, $\geq$3, $\geq$4), Event-T2M consistently achieves the lowest FID and the highest R-Precision, while baselines degrade sharply under compositional complexity. (b) Efficiency analysis at $\geq$4 events shows that Event-T2M achieves high accuracy with low model size, demonstrating its compactness and scalability.
  • Figure 3: Results of the user study (7-point Likert scale). Error bars denote standard errors. (a) Fidelity, (b) Order alignment, and (c) Naturalness. Event-T2M achieves significant gains over all competing methods and performs on par with ground-truth (GT).
  • Figure 4: Qualitative comparison with a complex multi-event prompt. Event-T2M executes all events in order and with correct counts, while baselines often fail to generate them faithfully. See supplementary video for full motions.
  • Figure 5: (a) Number of samples in the HumanML3D test set and HumanML3D-E. (b) User study of prompts. Error bars denote standard errors. Asterisks denote statistical significance ($\ast\ast$: $p < 0.01$).
  • ...and 2 more figures