Table of Contents
Fetching ...

InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

Zeyu Zhang, Akide Liu, Qi Chen, Feng Chen, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang

TL;DR

InfiniMotion presents a memory-augmented autoregressive framework for long-text-to-motion generation. The Motion Memory Transformer, enhanced by Bidirectional Mamba Memory, enables coherent generation across extremely long sequences by preserving global semantics and local transitions. Through Residual VQ-VAE, Mask Transformer, and Residual Transformer components, the method discretizes motion, aligns text with tokens, and models multi-layer representations, respectively. Evaluations on the BABEL dataset show over a 30% improvement in FID and the ability to produce motions six times longer than prior work, highlighting strong potential for film, games, and robotics applications.

Abstract

Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling. To solve these challenges, we propose InfiniMotion, a method that generates continuous motion sequences of arbitrary length within an autoregressive framework. We highlight its groundbreaking capability by generating a continuous 1-hour human motion with around 80,000 frames. Specifically, we introduce the Motion Memory Transformer with Bidirectional Mamba Memory, enhancing the transformer's memory to process long motion sequences effectively without overwhelming computational resources. Notably our method achieves over 30% improvement in FID and 6 times longer demonstration compared to previous state-of-the-art methods, showcasing significant advancements in long motion generation. See project webpage: https://steve-zeyu-zhang.github.io/InfiniMotion/

InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

TL;DR

InfiniMotion presents a memory-augmented autoregressive framework for long-text-to-motion generation. The Motion Memory Transformer, enhanced by Bidirectional Mamba Memory, enables coherent generation across extremely long sequences by preserving global semantics and local transitions. Through Residual VQ-VAE, Mask Transformer, and Residual Transformer components, the method discretizes motion, aligns text with tokens, and models multi-layer representations, respectively. Evaluations on the BABEL dataset show over a 30% improvement in FID and the ability to produce motions six times longer than prior work, highlighting strong potential for film, games, and robotics applications.

Abstract

Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling. To solve these challenges, we propose InfiniMotion, a method that generates continuous motion sequences of arbitrary length within an autoregressive framework. We highlight its groundbreaking capability by generating a continuous 1-hour human motion with around 80,000 frames. Specifically, we introduce the Motion Memory Transformer with Bidirectional Mamba Memory, enhancing the transformer's memory to process long motion sequences effectively without overwhelming computational resources. Notably our method achieves over 30% improvement in FID and 6 times longer demonstration compared to previous state-of-the-art methods, showcasing significant advancements in long motion generation. See project webpage: https://steve-zeyu-zhang.github.io/InfiniMotion/
Paper Structure (22 sections, 11 equations, 4 figures, 3 tables)

This paper contains 22 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The diagram illustrates a variety of representative examples of long motion sequences created by our innovative method, InfiniMotion. Each instance within the diagram is based on at least three consecutive user queries, each describing a distinct action. These examples highlight our method's ability to produce high-quality motion sequences, characterized by smooth transitions between different actions. Click on the diagram for a 1-hour demo video.
  • Figure 2: This diagram illustrates the main architecture of our proposed method. The method processes a stream of motion segments in an autoregressive manner within a recurrent memory architecture. The Motion Memory Transformer (MMT) enhances each motion segment with a specialized memory token [mem], which facilitates both long-term semantic coherence and smooth transitions between adjacent motion segments based on user text queries. Within the MMT, we leverage the robust long-term memory capabilities of Mamba gu2023mamba, and we have customized a Bidirectional Mamba Memory (BMM) block to further enhance the memory within the transformer. This customization ensures long-term coherence that corresponds to the overall semantics of the entire motion sequence.
  • Figure 3: The diagram presents additional examples of long motion sequences generated by our proposed method. These examples highlight the method's ability to produce smooth transitions between motion segments, resulting in high-quality and diverse motion outputs.
  • Figure 4: This figure displays the User Interface (UI) used in our User Study, showcasing four videos (Video A to D) each with distinct motion animations from the same model. Participants evaluate these animations on aspects such as motion accuracy, and overall user experience. They rate each aspect from 1 (low) to 5 (high) to assess how the animations mirror real-world movements and their engagement level. This evaluation aims to determine the realism and engagement effectiveness of each motion.