ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Mingyuan Zhang; Xinying Guo; Liang Pan; Zhongang Cai; Fangzhou Hong; Huirong Li; Lei Yang; Ziwei Liu

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu

TL;DR

ReMoDiffuse tackles the challenge of generating diverse and high-quality 3D human motions from text prompts by integrating a retrieval mechanism into a diffusion-based motion model. It introduces Hybrid Retrieval and a Semantics-Modulated Transformer to selectively fuse semantic and kinematic information from retrieved samples, with a learnable Condition Mixture to balance multiple conditioning signals during inference. Comprehensive experiments on KIT-ML and HumanML3D demonstrate superior performance, especially for uncommon or diverse motions, supported by new diversity-oriented metrics. The approach offers improved generalization and fidelity with efficient inference, though it acknowledges potential misuse for synthetic media generation.

Abstract

3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

TL;DR

Abstract

Paper Structure (31 sections, 7 equations, 6 figures, 6 tables)

This paper contains 31 sections, 7 equations, 6 figures, 6 tables.

Introduction
Related Work
Diffusion Models
Text-Driven Motion Generation
Framework Overview
Diffusion Model for Motion Generation
Retrieval-Augmented Motion Generation
Hybrid Retrieval.
Network Architecture.
Semantics-Modulated Attention.
Stylization Block.
Condition Mixture
Constrastive Model.
Parameter Finetuning.
Training and Inference
...and 16 more sections

Figures (6)

Figure 1: ReMoDiffuse is a retrieval-augmented 3D human motion diffusion model. Benefiting from the extra knowledge from the retrieved samples, ReMoDiffuse is able to achieve high-fidelity on the given prompts.
Figure 2: Overview of the proposed ReMoDiffuse. a) Hybrid retrieval database stores various features of each training data. The pre-processed text feature and relative difference of motion length are sent to calculate the similarity with the given language description. The most similar ones are fed into the semantics-modulated transformer (SMT), serving as additional clues for motion generation. b) Semantics-modulated transformer incorporates $N$ identical decoder layers, including a semantics-modulated attention (SMA) layer and an FFN layer. The figure shows the detailed architecture of SMA module. CLIP's extracted text features $f_{\mathrm{prompt}}$ from the given prompt, features $R^t$ and $R^m$ from the retrieved samples, and current motion features $f_{\Theta}$ will further refine the noised motion sequence. c) To synthesize diverse and realistic motion sequences, starting from the pure noised sample, the motion transformer repeatedly eliminates the noise. To better mix outputs under different combinations of conditions, we suggest a training strategy to find the optimal hyper-parameters $w_1,w_2,w_3$ and $w_4$.
Figure 3: Architecture of the stylization block. This module is adapted from MotionDiffuse zhang2022motiondiffuse. We remove the prompt embedding from the original design to better support classifier-free guidance. This module attempts to inject the information of the current timestamp into the feature representation, which is necessary for denoising steps. Specifically, the timestamp embedding $e_t$ is fed into a series of transformation layers. Two embeddings are generated afterward and serve as an additive offset and a multiplicative offset to the original feature map, respectively.
Figure 4: Visual Comparison between previous works and ReMoDiffuse. We draw black lines to show the translation path. As for both given conditions, only ReMoDiffuse conveys accurate action and path condition.
Figure 5: Rareness distribution of HumanML3D test split. We split all testcases into 100 bins according to its Rareness value.
...and 1 more figures

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

TL;DR

Abstract

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (6)