Table of Contents
Fetching ...

RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

Zhouyingcheng Liao, Mingyuan Zhang, Wenjia Wang, Lei Yang, Taku Komura

TL;DR

The paper tackles the generalization gap in text-to-human motion generation by introducing RMD, a training-free retrieval-augmented baseline that decomposes prompts with an LLM, retrieves multi-granularity motions from an external database, composes them coherently, and refines the result with a pretrained motion diffusion model. By employing a hierarchical retrieval strategy and a SDEdit-style diffusion refinement, RMD balances retrieved guidance with diffusion priors to achieve superior performance on in-domain and out-of-domain data without extra training, as demonstrated on HumanML3D and Mixamo benchmarks. Key results show improvements in R-Precision and MM Dist, alongside favorable user-study feedback for OOD prompts, validating improved semantic alignment and motion naturalness. The work highlights practical impact: leveraging external data and a strong diffusion prior at inference to achieve generalizable motion generation with minimal design complexity and training overhead, while also outlining avenues for automatic $t_0$ selection in future work.

Abstract

While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.

RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

TL;DR

The paper tackles the generalization gap in text-to-human motion generation by introducing RMD, a training-free retrieval-augmented baseline that decomposes prompts with an LLM, retrieves multi-granularity motions from an external database, composes them coherently, and refines the result with a pretrained motion diffusion model. By employing a hierarchical retrieval strategy and a SDEdit-style diffusion refinement, RMD balances retrieved guidance with diffusion priors to achieve superior performance on in-domain and out-of-domain data without extra training, as demonstrated on HumanML3D and Mixamo benchmarks. Key results show improvements in R-Precision and MM Dist, alongside favorable user-study feedback for OOD prompts, validating improved semantic alignment and motion naturalness. The work highlights practical impact: leveraging external data and a strong diffusion prior at inference to achieve generalizable motion generation with minimal design complexity and training overhead, while also outlining avenues for automatic selection in future work.

Abstract

While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.

Paper Structure

This paper contains 31 sections, 3 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Existing methods struggle with out-of-distribution motion generation due to two main challenges: (1) The compositional complexity of human motion makes it difficult for training sets to cover all possible full-body motions; (2) Diverse motion descriptions create a persistent gap between testing and training prompts. We propose a simple baseline RMD, a retrieval-augmented, training-free method with two stages: motion retrieval, using a decompose-retrieve-recompose hierarchical strategy of 3 different levels to bridge the aforementioned gap, and motion diffusion, refining the composed motion with a pre-trained diffusion model to enhance body coordination and enrich generation diversity.
  • Figure 2: Method overview of RMD. Given a query text prompt, RMD uses a Decomposition Agent to split the prompt into body parts descriptions and a Retrieval Agent to search for corresponding motions. In the first stage, a hierarchical retrieval strategy is employed, prioritizing full-body to fine-grained motions. The process stops once the retrieval score meets the threshold, and the retrieved body parts are recomposed into a full motion, serving as the guided motion. In the second stage, RMD leverages a pre-trained motion diffusion model to refine the guided motion with the original query prompt, yielding the final motion.
  • Figure 3: Qualitative comparison between our method and previous methods. Our method achieves the best text alignment.
  • Figure 4: Generated motions with various $t_0$.$t_0=0$ is the guided motion. $t_0=1$ means starting from pure noise and is equivalent to MotionDiffuse. Since we use the same random seed for all samples here. $t_0$ in between can be seen as an interpolation between the guided motion and the pure diffusion generation. In the first two rows, the guided motion has obvious artifacts, while MotionDiffuse fails to understand the prompt. Yet a $t_0$ in the middle incorporates the semantic information from the guided motion while free from artifacts. In the third row, the guided motion retrieves a spinning motion without dancing, while MotionDiffuse generates dance without spinning. A proper $t_0$ could combine these information to produce better results.
  • Figure 5: User study on OOD data. Our method outperforms others by a significant margin.
  • ...and 1 more figures