Table of Contents
Fetching ...

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu

Abstract

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Abstract

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.
Paper Structure (57 sections, 30 equations, 3 figures, 8 tables)

This paper contains 57 sections, 30 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: (a) A unified Perception–Planning–Control pipeline for conditional motion generation. (b) MoTok enables compact motion tokenization with fewer tokens while maintaining competitive performance. (c) Bridging semantics and kinematics by applying coarse constraints in planning and fine constraints in motion control. MoTok maintains and improves fidelity as more joints are controlled, rather than compromising between controllability and realism. (d) Left-wrist XY trajectories with sparse control (red triangles). MoTok yields the most natural trajectory and best alignment.
  • Figure 2: Overview of MoTok and the unified motion generation framework.(a) MoTok factorizes motion representation into compact discrete tokens and diffusion-based reconstruction by decoding tokens into per-frame conditioning for conditional diffusion. (b) A unified conditional generation framework built on MoTok supports both discrete diffusion and autoregressive planners, integrating global and local conditions in a generator-agnostic manner.
  • Figure 3: Visual comparison with state-of-the-art methods for any-joint any-frame control. The right panels show trajectory views of two of the controlled joints. Red indicates the control signal and blue indicates the generated motion.