Table of Contents
Fetching ...

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Min zhang, Hao Fei

TL;DR

VimoRAG tackles the data scarcity and OOD/OOV challenges in motion LLMs by introducing a video-based retrieval-augmented generation framework that leverages large in-the-wild video databases. The approach couples Gemini-MVR, a motion-centric video retriever with dual channels, and McDPO, a dual-alignment training strategy that guides an LLM to appropriately use retrieved video priors and self-correct. Empirical results show substantial improvements in both out-of-domain (IDEA400) and in-domain (HumanML3D) settings, with performance improving as the retrieval corpus grows, highlighting strong scalability. The work advances practical motion generation by reducing reliance on limited annotated 3D motion data and enabling robust, video-informed generation. Its framework lays groundwork for broader multimodal RAG integrations and more efficient backbone selection in future motion-language systems.

Abstract

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input. All the resources are available at https://walkermitty.github.io/VimoRAG/

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

TL;DR

VimoRAG tackles the data scarcity and OOD/OOV challenges in motion LLMs by introducing a video-based retrieval-augmented generation framework that leverages large in-the-wild video databases. The approach couples Gemini-MVR, a motion-centric video retriever with dual channels, and McDPO, a dual-alignment training strategy that guides an LLM to appropriately use retrieved video priors and self-correct. Empirical results show substantial improvements in both out-of-domain (IDEA400) and in-domain (HumanML3D) settings, with performance improving as the retrieval corpus grows, highlighting strong scalability. The work advances practical motion generation by reducing reliance on limited annotated 3D motion data and enabling robust, video-informed generation. Its framework lays groundwork for broader multimodal RAG integrations and more efficient backbone selection in future motion-language systems.

Abstract

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input. All the resources are available at https://walkermitty.github.io/VimoRAG/

Paper Structure

This paper contains 44 sections, 6 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: ReMoDiffuse is a RAG-based motion generation method, which is limited by the small scale of motion data and its reliance on annotated captions. We propose VimoRAG, which advances in ① enabling retrieval from large-scale, in-the-wild video databases without text captions. ② Identifying and overcoming key challenges in human-centric text-to-video retrieval. ③ Ensuring alignment between retrieved videos and generated motions while mitigating error propagation.
  • Figure 2: Overview of the VimoRAG pipeline: (1) text-to-video retrieval via Gemini-MVR, and (2) video-augmented motion generation guided by text and retrieved video. Gemini-MVR (Sec. \ref{['Gemini-MVR']}) is designed to improve cross-modal human-centric video retrieval, while the McDPO training strategy (Sec. \ref{['ViMo']}) mitigates error propagation caused by noisy retrievals.
  • Figure 3: The architecture of the Gemini-MVR model. $\theta_{\mathcal{P}}$ and $\theta_{\mathcal{G}}$ represent the predicate semantic extractor and the argument semantic extractor, respectively. $\theta_{\mathcal{A}}$ and $\theta_{\mathcal{O}}$ denote the action encoder and object encoder, respectively. We simply introduce a lightweight action-level retriever and a routing module $\mathcal{I}$, while keeping the architecture of VFMs unchanged. This twin-module design provides strong extensibility.
  • Figure 4: The McDPO training strategy. Given a text $t$ and a retrieved video $v$, we first perform visual demonstration-enhanced instruction tuning to establish a base reference model $\pi_{ref}$. Then, based on the motion-centric dual-alignment reward model, we construct a preference dataset and apply DPO training. The reward model jointly measures motion similarity in the feature space and semantic consistency with the text, guiding the model to learn informative motion priors and maximize preference rewards through self-improvement.
  • Figure 5: Zero-shot qualitative results on IDEA400 test set. All motions are directly generated by the models trained on HumanML3D training set. The text presented here only includes words related to motion due to space constraints. The full text and more results are available in Figure \ref{['fig:appendix1']} and \ref{['fig:appendix2']}.
  • ...and 8 more figures