Table of Contents
Fetching ...

Absolute Coordinates Make Motion Generation Easy

Zichong Meng, Zeyu Han, Xiaogang Peng, Yiming Xie, Huaizu Jiang

TL;DR

The paper revisits motion representation for text-to-motion diffusion and argues that absolute joint coordinates in global space, rather than localized kinematic-aware relative representations, yield higher fidelity and easier controllability. It introduces ACMDM, a Transformer-based diffusion model operating on absolute coordinates, with AdaLN conditioning and a velocity-based denoising objective that surpasses prior state-of-the-art. The approach naturally supports downstream tasks such as text-driven control, editing, and direct mesh (SMPL-H) vertex generation via a latent mesh autoencoder, demonstrated through extensive experiments on HumanML3D and KIT. Ablation studies confirm the superiority of the absolute-coordinate formulation, and the method shows strong scalability and practical advantages over controllable motion baselines that rely on classifier guidance. Overall, this work lays a foundation for broader text-to-motion generation, including direct mesh-level synthesis, by removing the need for complex kinometric representations.

Abstract

State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.

Absolute Coordinates Make Motion Generation Easy

TL;DR

The paper revisits motion representation for text-to-motion diffusion and argues that absolute joint coordinates in global space, rather than localized kinematic-aware relative representations, yield higher fidelity and easier controllability. It introduces ACMDM, a Transformer-based diffusion model operating on absolute coordinates, with AdaLN conditioning and a velocity-based denoising objective that surpasses prior state-of-the-art. The approach naturally supports downstream tasks such as text-driven control, editing, and direct mesh (SMPL-H) vertex generation via a latent mesh autoencoder, demonstrated through extensive experiments on HumanML3D and KIT. Ablation studies confirm the superiority of the absolute-coordinate formulation, and the method shows strong scalability and practical advantages over controllable motion baselines that rely on classifier guidance. Overall, this work lays a foundation for broader text-to-motion generation, including direct mesh-level synthesis, by removing the need for complex kinometric representations.

Abstract

State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Absolute coordinates make motion generation easy. Here we show that our model produces motion of higher fidelity, has better controllability, and reports promising results of generating SMPL-H meshes directly.
  • Figure 2: Overview of our proposed ACMDM. (a) Left: The raw/latent absolute coordinates representation is patchified and processed through a sequence of ACMDM blocks. Right: Details of ACMDM blocks, where we experiment with two conditioning variants: concatenation and AdaLN. (b) ControlNet-augmented ACMDM for controllable motion generation: Structured control signals are separately encoded and fused into the ACMDM generation process via additive residuals at each ACMDM block, enabling the model to follow both semantical and spatial controlling constraints.
  • Figure 3: Visual comparisons of generated motion between ACMDM and state-of-the-art methods. ACMDM generates more realistic motion that accurately follows the textual condition.
  • Figure 4: Scaling of ACMDM with model capacity and decreasing patch size. We use red for S, orange for B, green for L, and blue for XL, with color gradients indicating decreasing patch sizes. ACMDM exhibits strong scalability, with performance consistently improving as model size increases and patch size decreases.
  • Figure A1: Model and patch size scaling results of ACMDM. Top row: FID and R-Precision Top 1 are compared while holding patch size constant. Bottom row: Results are shown while holding model size constant. Our model exhibits strong scalability with increasing model capacity and decreasing patch size.