KMM: Key Frame Mask Mamba for Extended Motion Generation

Zeyu Zhang; Hang Gao; Akide Liu; Qi Chen; Feng Chen; Yiran Wang; Danning Li; Rui Zhao; Zhenming Li; Zhongwen Zhou; Hao Tang; Bohan Zhuang

KMM: Key Frame Mask Mamba for Extended Motion Generation

Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang, Danning Li, Rui Zhao, Zhenming Li, Zhongwen Zhou, Hao Tang, Bohan Zhuang

TL;DR

This work tackles extended motion generation by addressing Mamba's memory limitations and weak text–motion fusion. It introduces Key Frame Masking Modeling (KMM), which selects and masks salient key frames based on local density and distance to higher-density frames to better leverage Mamba's implicit memory. It also introduces a contrastive text–motion alignment objective that learns dynamic text representations, improving alignment beyond frozen CLIP-based approaches. Through extensive experiments on BABEL, BABEL-D, and HumanML3D, KMM achieves state-of-the-art results with substantial FID improvements and reduced compute, enabling more accurate, robust, and diverse long-motion generation for real-world applications.

Abstract

Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: https://steve-zeyu-zhang.github.io/KMM

KMM: Key Frame Mask Mamba for Extended Motion Generation

TL;DR

Abstract

KMM: Key Frame Mask Mamba for Extended Motion Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)