DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
TL;DR
DisCoRD addresses the discord between discrete and continuous human motion representations by decoding pretrained discrete motion tokens in a continuous space using a rectified flow decoder. It introduces Condition Projection to map tokens to frame-wise conditioning and trains on sliding motion windows, enabling smoother, more expressive motion while preserving conditioning faithfulness. A novel evaluation metric, symmetric Jerk Percentage Error (sJPE), explicitly captures frame-wise noise and under-reconstruction, aligning better with human perception of naturalness. Across text-to-motion, co-speech gesture, and music-to-dance tasks, DisCoRD achieves state-of-the-art naturalness (lower FID) while maintaining faithfulness, and is compatible with any discrete-token framework, offering a versatile, scalable solution for bridging discrete efficiency and continuous realism.
Abstract
Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this 'discord' between discrete and continuous representations we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results establish DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Project website: https://whwjdqls.github.io/discord-motion/
