Table of Contents
Fetching ...

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu

TL;DR

DisCoRD addresses the discord between discrete and continuous human motion representations by decoding pretrained discrete motion tokens in a continuous space using a rectified flow decoder. It introduces Condition Projection to map tokens to frame-wise conditioning and trains on sliding motion windows, enabling smoother, more expressive motion while preserving conditioning faithfulness. A novel evaluation metric, symmetric Jerk Percentage Error (sJPE), explicitly captures frame-wise noise and under-reconstruction, aligning better with human perception of naturalness. Across text-to-motion, co-speech gesture, and music-to-dance tasks, DisCoRD achieves state-of-the-art naturalness (lower FID) while maintaining faithfulness, and is compatible with any discrete-token framework, offering a versatile, scalable solution for bridging discrete efficiency and continuous realism.

Abstract

Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this 'discord' between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Project website: https://whwjdqls.github.io/discord-motion/

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

TL;DR

DisCoRD addresses the discord between discrete and continuous human motion representations by decoding pretrained discrete motion tokens in a continuous space using a rectified flow decoder. It introduces Condition Projection to map tokens to frame-wise conditioning and trains on sliding motion windows, enabling smoother, more expressive motion while preserving conditioning faithfulness. A novel evaluation metric, symmetric Jerk Percentage Error (sJPE), explicitly captures frame-wise noise and under-reconstruction, aligning better with human perception of naturalness. Across text-to-motion, co-speech gesture, and music-to-dance tasks, DisCoRD achieves state-of-the-art naturalness (lower FID) while maintaining faithfulness, and is compatible with any discrete-token framework, offering a versatile, scalable solution for bridging discrete efficiency and continuous realism.

Abstract

Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this 'discord' between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Project website: https://whwjdqls.github.io/discord-motion/

Paper Structure

This paper contains 43 sections, 8 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Concept of DisCoRD. Discrete quantization methods encode multiple motions into a single quantized representation. While existing methods directly decode from this quantized representation, DisCoRD iteratively decodes the discrete latent in a continuous space to recover the inherent continuity and dynamism of motion. To assess the gap between reconstructed and real motion, prior work primarily used FID as the metric. Here, we additionally propose symmetric Jerk Percentage Error (sJPE) to evaluate the differences in naturalness between reconstructed and real motion.
  • Figure 2: An overview of DisCoRD. During the Training stage, we leverage a pretrained quantizer to first obtain discrete representations (tokens) of motion. These tokens are then projected into continuous features $\mathbf{C}$, which are concatenated with noisy motion $\mathbf{X}_t$. This concatenated feature is used to train a vector field $v$. During the Inference stage, we use a pretrained token prediction model based on the pretrained quantizer to first generate tokens from the given control signal. These generated tokens are then projected into continuous features $\mathbf{\hat{C}}$, concatenated with Gaussian noise $\mathbf{X}_0\sim \mathcal{N}(0,I)$, and iteratively decoded through the learned vector field $v_\theta$ into motion $\mathbf{\hat{X}}_1$.
  • Figure 3: sJPE and FID response to frame-wise gaussian noise. We introduce Gaussian noise with varying standard deviations (x-axis) to ground-truth motion data and evaluate its effect on sJPE and FID. Noise sJPE is highly sensitive to subtle frame-wise perturbations, whereas Static sJPE remains low. FID is highly insensitive to frame-wise noise. Note that FID scale (y-axis, right) is very small compared to sJPE scale (y-axis, left).
  • Figure 4: Under-reconstruction and frame-wise noise. We visualize fine-grained motion trajectories (top), and corresponding jerk graphs (bottom), where blue and red regions indicate noise and static sJPE, respectively. Compared to other methods, DisCoRD significantly reduces sJPE, resulting in smoother motion (fewer blue regions) and greater dynamism (fewer red regions), as highlighted in green boxes.
  • Figure 5: Decoding efficiency comparison. We report the average decoding time for a batch of 32 token sequences on an NVIDIA RTX 4090 Ti, averaged over 20 trials on the HumanML3D test set. DisCoRD achieves more better performance on motion naturalness at a comparable decoding speed to MoMask and can even decode significantly faster while maintaining superior performance.
  • ...and 12 more figures