Table of Contents
Fetching ...

TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

Yiyang Cao, Yunze Deng, Ziyu Lin, Bin Feng, Xinggang Wang, Wenyu Liu, Dandan Zheng, Jingdong Chen

TL;DR

TriC-Motion tackles the lack of a unified tri-domain framework for text-to-motion generation by jointly modeling spatial, temporal, and frequency cues within a diffusion-based denoising architecture, augmented with causal intervention. It introduces Temporal Motion Encoding, Spatial Topology Modeling, Hybrid Frequency Analysis, Score-guided Tri-domain Fusion, and a Causality-based Counterfactual Motion Disentangler to suppress motion-irrelevant cues during training. The approach is trained with a combination of L_simple, L_fcf, and L_p losses, and evaluated on HumanML3D and SnapMoGen, where it achieves state-of-the-art R-Precision and improved FID/MM-Dist along with favorable perceptual quality in a user study. Overall, TriC-Motion demonstrates that integrating tri-domain information with causal interventions yields high-fidelity, coherent, diverse, and text-aligned motion, with practical implications for animation and embodied AI.

Abstract

Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.

TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

TL;DR

TriC-Motion tackles the lack of a unified tri-domain framework for text-to-motion generation by jointly modeling spatial, temporal, and frequency cues within a diffusion-based denoising architecture, augmented with causal intervention. It introduces Temporal Motion Encoding, Spatial Topology Modeling, Hybrid Frequency Analysis, Score-guided Tri-domain Fusion, and a Causality-based Counterfactual Motion Disentangler to suppress motion-irrelevant cues during training. The approach is trained with a combination of L_simple, L_fcf, and L_p losses, and evaluated on HumanML3D and SnapMoGen, where it achieves state-of-the-art R-Precision and improved FID/MM-Dist along with favorable perceptual quality in a user study. Overall, TriC-Motion demonstrates that integrating tri-domain information with causal interventions yields high-fidelity, coherent, diverse, and text-aligned motion, with practical implications for animation and embodied AI.

Abstract

Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.
Paper Structure (26 sections, 11 equations, 6 figures, 10 tables)

This paper contains 26 sections, 11 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: (a) Visual comparison of motion generated before and after spatial modeling/frequency modeling/causal intervention; (b) Quantitative comparison of different methods’ performance on HumanML3D.
  • Figure 2: Structured Casual Model in TriC-Motion.
  • Figure 3: Overview of TriC-Motion. (a) Sampling process with stacked TriC-Motion Denoiser Blocks. (b) Overall architecture of the TriC-Motion framework.
  • Figure 4: Detailed architectures of TriC-Motion main components. (a) HFA with DWT/FFT decomposition; (b) Low-frequency branch network in HFA; (c) High-frequency branch network in HFA; (d) S-Fus with motion and semantic scoring; (e) Details of CCMD.
  • Figure 5: Qualitative comparisons on HumanML3D dataset.
  • ...and 1 more figures