Table of Contents
Fetching ...

KMM: Key Frame Mask Mamba for Extended Motion Generation

Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang, Danning Li, Rui Zhao, Zhenming Li, Zhongwen Zhou, Hao Tang, Bohan Zhuang

TL;DR

This work tackles extended motion generation by addressing Mamba's memory limitations and weak text–motion fusion. It introduces Key Frame Masking Modeling (KMM), which selects and masks salient key frames based on local density and distance to higher-density frames to better leverage Mamba's implicit memory. It also introduces a contrastive text–motion alignment objective that learns dynamic text representations, improving alignment beyond frozen CLIP-based approaches. Through extensive experiments on BABEL, BABEL-D, and HumanML3D, KMM achieves state-of-the-art results with substantial FID improvements and reduced compute, enabling more accurate, robust, and diverse long-motion generation for real-world applications.

Abstract

Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: https://steve-zeyu-zhang.github.io/KMM

KMM: Key Frame Mask Mamba for Extended Motion Generation

TL;DR

This work tackles extended motion generation by addressing Mamba's memory limitations and weak text–motion fusion. It introduces Key Frame Masking Modeling (KMM), which selects and masks salient key frames based on local density and distance to higher-density frames to better leverage Mamba's implicit memory. It also introduces a contrastive text–motion alignment objective that learns dynamic text representations, improving alignment beyond frozen CLIP-based approaches. Through extensive experiments on BABEL, BABEL-D, and HumanML3D, KMM achieves state-of-the-art results with substantial FID improvements and reduced compute, enabling more accurate, robust, and diverse long-motion generation for real-world applications.

Abstract

Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: https://steve-zeyu-zhang.github.io/KMM

Paper Structure

This paper contains 25 sections, 14 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: The figure illustrates that previous extended motion generation methods often struggle with directional instructions, leading to incorrect motions. In contrast, our proposed KMM, with enhanced text-motion alignment, effectively improves the model's understanding of text queries, resulting in more accurate motion generation.
  • Figure 2: The figure demonstrates our novel method from three different perspectives: (a) illustrates the key frame masking strategy based on local density and minimum distance to higher density calculation. (b) showcases the overall architecture of the masked bidirectional Mamba. (c) demonstrates the text-to-motion alignment, highlighting the process before and after alignment.
  • Figure 3: The figure demonstrates a qualitative comparison between the previous state-of-the-art method in extended motion generation and our KMM. The qualitative results show that our method significantly outperforms others in handling complex text queries and generating more accurate corresponding motions.
  • Figure 4: The figure presents some qualitative visualization results of KMM. The text prompts are sourced and combined from HumanML3D guo2022generating and BABEL punnakkal2021babel. The number within the brackets indicates our ability to condition the generated motion on a specific length, dynamically producing motion of the desired duration. The visualizations showcase KMM's superior performance in generating robust and diverse motions that align closely with lengthy and complex text queries.
  • Figure 5: The figure shows the user study interface where 50 participants evaluated motion sequences generated by TEACH, PriorMDM, FlowMDM, and KMM, focusing on text-motion alignment, robustness, diversity, and usability. The text prompt are randomly extracted and combined from the HumanML3D guo2022generating and BABEL punnakkal2021babel test set.