Table of Contents
Fetching ...

Temporal Consistency-Aware Text-to-Motion Generation

Hongsong Wang, Wenjing Yan, Qiuxia Lai, Xin Geng

TL;DR

This work proposes TCA-T2M, a framework for temporal consistency-aware T2M generation that introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation.

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.

Temporal Consistency-Aware Text-to-Motion Generation

TL;DR

This work proposes TCA-T2M, a framework for temporal consistency-aware T2M generation that introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation.

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.
Paper Structure (13 sections, 12 equations, 5 figures, 6 tables)

This paper contains 13 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of temporal consistency across three distinct human action sequences. (a) A person walks forward; (b) A person walks and sits down; (c) A person sits down and stands up. Despite differences in kinematic details, these sequences exhibit shared temporal structures. Enforcing temporal alignment in the latent space ensures that motion representation encoder $E$ maps corresponding action phases across sequences to similar representations. This constraint enables the learned motion representation to capture semantic information while preserving temporal consistency, which is essential for the subsequent text-conditioned motion generation stage in T2M.
  • Figure 2: Method overview. (a) Temporal consistency-aware spatial VQ-VAE employs hierarchical residual quantization to discretize motion features, incorporates cycle-consistency constraints to enforce temporal coherence, and utilizes a kinematic constraint block to refine motion details. (b) Masked motion transformer adopts a dual-transformer structure for cross-modal text-motion synthesis. Specifically, the motion transformer restores masked motion tokens under CLIP text guidance, while the residual transformer predicts successive residual tokens using textual context and preceding token sequences.
  • Figure 3: Qualitative comparisons between MDMtevet2022human and our method across representative motion from the HumanML3D dataset. Key frames highlight critical motion details. The visual comparisons underscore our method's strength in semantic comprehension of textual prompts and consistent action execution across multi-step sequences with dynamic environment adaptation.
  • Figure 4: Visualizations of long motion generation and zero-shot motion generation. (a) Long motion generation. We integrate three text prompts---"a person walks forward then turns right","a person crawling from right to left" and"the person is walking in a counterclockwise circl"—to generate a long-sequence motion. (b) Zero-shot generation. We separately test the text prompts"A person is climbing a ladde" and"A person is rolling" to generate zero-shot motion sequences.
  • Figure 5: Visualizations of failure cases of our approach. (a) shows that a person walks up stairs. Then the person turns right and walks back down stairs. (b) shows a man walking backwards. Then, he punches and kicks. (c) shows a man getting down on his hands and feet and crawling forward. Then, he turns around and crawls back before standing up again.