Table of Contents
Fetching ...

ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Wanjiang Weng, Xiaofeng Tan, Junbo Wang, Guo-Sen Xie, Pan Zhou, Hongsong Wang

TL;DR

The paper tackles the misalignment issue between text and diffusion-generated motions in text-to-motion systems. It introduces ReAlign, a plug-and-play reward-guided sampling framework that combines a step-aware reward model with a dual-alignment reward to form an ideal sampling distribution $p_t^{I}(\mathbf{x}|c)=p_t(\mathbf{x}|c)p_t^{r}(\mathbf{x}|c)/Z(c)$, guiding both the continuous reverse SDE and discrete DDPM updates. The approach provides a theoretical basis showing the reward gradient decomposes into components that steer denoising toward text-motion fidelity and motion realism, while the step-aware design handles noise variations across timesteps. Empirically, ReAlign yields significant improvements in text-motion alignment and motion quality across multiple baselines and datasets, and demonstrates strong text-to-motion retrieval enhancements, all without requiring diffusion-model fine-tuning. The work highlights the practicality of integrating reward signals at inference time to improve diffusion-based generation, with potential extensions to broader reward types and tasks.

Abstract

Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

TL;DR

The paper tackles the misalignment issue between text and diffusion-generated motions in text-to-motion systems. It introduces ReAlign, a plug-and-play reward-guided sampling framework that combines a step-aware reward model with a dual-alignment reward to form an ideal sampling distribution , guiding both the continuous reverse SDE and discrete DDPM updates. The approach provides a theoretical basis showing the reward gradient decomposes into components that steer denoising toward text-motion fidelity and motion realism, while the step-aware design handles noise variations across timesteps. Empirically, ReAlign yields significant improvements in text-motion alignment and motion quality across multiple baselines and datasets, and demonstrates strong text-to-motion retrieval enhancements, all without requiring diffusion-model fine-tuning. The work highlights the practicality of integrating reward signals at inference time to improve diffusion-based generation, with potential extensions to broader reward types and tasks.

Abstract

Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

Paper Structure

This paper contains 12 sections, 2 theorems, 13 equations, 4 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

When using the ideal sampling distribution $p_t^{I}(\mathbf{x}|c)$ in Eq. (eq:p_i^t) to replace the vanilla sampling distribution $p_t(\mathbf{x}|c)$, the reverse SDE becomes:

Figures (4)

  • Figure 1: Visual comparison of text-to-motion generation. This figure presents motions generated by existing methods, such as Mo.Diffuse Zhang2024, MDM Tevet2023, MLD Chen2023, and MotionLCM motionlcm. Our ReAlign enhances these models to produce motions that align more closely with text inputs.
  • Figure 2: Illustration of the sampling process in diffusion-based motion generation frameworks. The blue region represents the sampling distribution $p_t(\cdot)$ learned by the diffusion model, while the green region depicts the ideal sampling distribution $p_t^I(\cdot)$ achieved by incorporating our proposed reward-guided sampling strategy with the sampling distribution $p_t(\cdot)$.
  • Figure 3: Framework of step-aware reward model. During this process, time-aware tokens, consisting of timestep embedding $t$ and motion embeddings $x_t^k$, are aligned with text embedding $c$ in the latent space and reconstructed via the decoder, with the encoder and decoder jointly optimized by contrastive loss $\mathcal{L}_C$ and representation loss $\mathcal{L}_R$petrovich2022temos.
  • Figure 4: Comparison of motion generation quality across denoising steps for the MLD w/o ReAlign, MLD w/o Step-Aware, and MLD w/ Step-Aware (ReAlign). ReAlign consistently outperforms the others, highlighting the necessity of explicit noise handling during denoising.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2