Motion Anything: Any to Motion Generation
Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley
TL;DR
Motion Anything addresses two core challenges in multimodal motion generation: prioritizing dynamic frames and actions through an attention-guided masking scheme, and integrating multiple conditioning modalities (text and audio) for coherent control. The framework combines a Temporal Adaptive Transformer and a Spatial Aligning Transformer to align motion temporally and spatially with multimodal inputs, guided by an adaptive Attention-based Mask Modeling approach. A new Text-Music-Dance (TMD) dataset with 2,153 samples is introduced to benchmark multimodal conditioning, and Motion Anything achieves state-of-the-art or strong improvements across HumanML3D, KIT-ML, AIST++, and TMD, including a $15\%$ FID improvement on HumanML3D. The paper also demonstrates practical utility through a 4D avatar generation pipeline with a Selective Rigging Mechanism, illustrating end-to-end applicability from multimodal prompts to animated avatars. Overall, Motion Anything provides a versatile, precise framework for multimodal motion generation with strong empirical validation and new benchmark resources.
Abstract
Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Music-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything
