Table of Contents
Fetching ...

ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang

TL;DR

ReMoMask tackles the challenges of text-to-motion generation by unifying retrieval-augmented generation with masked, 2D-quantized motion representations. It introduces Bidirectional Momentum Text-Motion Modeling to expand negative samples for robust cross-modal retrieval, Semantics Spatial-Temporal Attention to fuse text, retrieved knowledge, and spatial-temporal motion structure, and RAG-Classifier-Free Guidance to improve generalization. The framework quantizes motion with a 2D RVQ-VAE, uses a 2D retrieval-augmented masked transformer for base token generation, and refines details with a 2D residual transformer, achieving state-of-the-art FID and retrieval metrics on HumanML3D and KIT-ML. Empirical results, including ablations and a user study, demonstrate improved realism and text-motion alignment, suggesting strong practical potential for controllable, diverse human motion synthesis in multimedia applications.

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

ReMoMask: Retrieval-Augmented Masked Motion Generation

TL;DR

ReMoMask tackles the challenges of text-to-motion generation by unifying retrieval-augmented generation with masked, 2D-quantized motion representations. It introduces Bidirectional Momentum Text-Motion Modeling to expand negative samples for robust cross-modal retrieval, Semantics Spatial-Temporal Attention to fuse text, retrieved knowledge, and spatial-temporal motion structure, and RAG-Classifier-Free Guidance to improve generalization. The framework quantizes motion with a 2D RVQ-VAE, uses a 2D retrieval-augmented masked transformer for base token generation, and refines details with a 2D residual transformer, achieving state-of-the-art FID and retrieval metrics on HumanML3D and KIT-ML. Empirical results, including ablations and a user study, demonstrate improved realism and text-motion alignment, suggesting strong practical potential for controllable, diverse human motion synthesis in multimedia applications.

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

Paper Structure

This paper contains 30 sections, 15 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between t2m models. (a) The conventional t2m models. (b) The Existing RAG-t2m models. (c) The framework of our proposed ReMoMask.
  • Figure 2: Overview of ReMoMask. (a) Bidirectional Momentum Contrastive Retrieval (BMM) uses two momentum queues, enabling a large pool of negative samples for contrastive learning. (b) ReMoMask quantizes a motion sequence into a 2D token map, capturing not only temporal dynamics but also spatial structure. After that, a Part-Level BMM Retriever retrieves relevant text and motion features based on the prompt embedding. All these conditions are fused via an SSTA module in a 2D RAG-Mask-Transformer together with the latent motion representaion. (c) Semantic Spatial-temporal Attention (SSTA) first flattens the masked 2D token map into a 1D structure, then redefines the Q, K, V matrix utilizing the conditions above, providing effective semantic alignment between the conditions and the spatial-temporal information of motion
  • Figure 3: Motion Quality User Study
  • Figure 4: Text-Motion Correspondence User Study
  • Figure 5: We randomly sample and visualize 16 motions generated by the proposed ReMoMask framework. These examples are conditioned on diverse prompts randomly selected from the HumanML3D HumanML3D, providing qualitative evidence of the model’s ability to synthesize a wide range of realistic and semantically coherent motions.
  • ...and 2 more figures