MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

Sai Shashank Kalakonda; Shubh Maheshwari; Ravi Kiran Sarvadevabhatla

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

Sai Shashank Kalakonda, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla

TL;DR

MoRAG tackles the challenge of broadening language generalization in text-to-human motion by introducing a multi-part retrieval strategy that uses LLM-generated part-specific descriptions to retrieve torso, hands, and legs motions. These part motions are spatially composed into full-body sequences and used as additional conditioning for diffusion-based motion generation via a Semantics-Modulated Transformer backbone, yielding improved semantic alignment, diversity, and zero-shot capability. The approach is instantiated as MoRAG and MoRAG-Diffuse, demonstrated on HumanML3D with GPT-3.5-turbo-instruct prompts, and shown to outperform prior text-to-motion retrieval and diffusion baselines. This work enables plug-and-play augmentation of diffusion models for more robust, varied, and unseen-text motion generation, with practical impact for realistic human motion synthesis in animation and robotics, while acknowledging dependencies on GPT prompts and dataset scale.

Abstract

We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos are available at: https://motion-rag.github.io/

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 9 figures, 3 tables)

This paper contains 28 sections, 6 equations, 9 figures, 3 tables.

Introduction
Related Works
Text-conditioned human motion generation
Motion Diffusion Models
Text-to-Motion retrieval
Proposed Method
Augmented Motion Retrieval Strategy
Generation of part-specific descriptions
Multi-part motion retrieval
Spatial motion composition
MoRAG-Diffuse
Experiments & Results
Dataset and Implementation Details
Results
Discussion
...and 13 more sections

Figures (9)

Figure 1: MoRAG is a retrieval-augmented framework for generating human motion from text. It integrates part-specific motion retrieval models with large language models to improve the quality of generation and retrieval tasks across various text descriptions. The black arrow illustrates motion translation. In the bottom figures, red, blue, and green represent the retrieved motion for the hands, torso, and legs. The varying transparency in the figure indicates the progression of time steps.
Figure 2: MoRAG utilizes part-specific descriptions to effectively retrieve relevant samples, demonstrating robustness to variations in motion length and descriptive text. In contrast, ReMoDiffuse zhang2023remodiffuse, a hybrid approach based on motion length and text similarity, fails to retrieve suitable samples when there are changes in motion length or text. Each figure of ReMoDiffuse displays the retrieved text at the top and the corresponding motion length in brackets. For MoRAG, three part-specific retrieved texts, along with their corresponding HumanML3D Guo_2022_CVPR ID, are provided using the #. tick and cross to indicate whether the motion corresponds to the input text.
Figure 3: MoRAG Overview: Given a text description text, we generate part-specific descriptions corresponding to "Torso," "Hands," and "Legs" by prompting an LLM. These generated descriptions are used as queries to retrieve corresponding part-specific motions: $R^i_{torso}$, $R^i_{hands}$, and $R^i_{legs}$ from the motion databases $D_{torso}$, $D_{hands}$, and $D_{legs}$, respectively. The retrieved motions are then fused to construct a full-body motion sequence $C^i$ that aligns with the input text. The constructed motion samples are used as additional information in the motion generation pipeline during both training and inference, alongside the input text, to further improve model performance.
Figure 4: MoRAG Training: Our objective is to construct three independent part-specific motion databases. The training paradigm includes three motion retrieval models: $MoRAG_{torso}$, $MoRAG_{hands}$, and $MoRAG_{legs}$, each corresponding to a specific body part. We train these three models independently using part-specific body movement descriptions generated by LLMs for text phrases $\texttt{text}_i$ and their corresponding full-body motion sequences $\texttt{motion}_i$. We adopt a contrastive training objective between part-specific text embeddings ($Z^T_{p, i}$) generated by text encoders ($T^{Enc}_p$) and motion embeddings ($Z^M_{p, i}$) generated by the corresponding part-specific motion encoder($M^{Enc}_p$). The diagonal elements, representing positive pairs (green), are maximized, while the off-diagonal elements, representing negative pairs with text similarity below a threshold (red), are minimized. For simplicity, we do not visualize the motion decoder, but we follow a similar training procedure as described in petrovich23tmr.
Figure 5: LLM Importance: Incorporating part-wise descriptions generated by LLMs into text-to-motion retrieval improves generalization over the language space. (a) Spell Error - MoRAG successfully retrieves and constructs the correct motion sequence when 'sit-ups' is replaced with 'situps', unlike TMRpetrovich23tmr. (b) Rephrasing - MoRAG effectively retrieves the correct motion sequence even when the voice is changed from active to passive. (c) Substitution - MoRAG accurately retrieves the correct motion sequence when 'chest' is replaced with its synonym 'heart'.
...and 4 more figures

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

TL;DR

Abstract

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

Authors

TL;DR

Abstract

Table of Contents

Figures (9)