Table of Contents
Fetching ...

Motion Anything: Any to Motion Generation

Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley

TL;DR

Motion Anything addresses two core challenges in multimodal motion generation: prioritizing dynamic frames and actions through an attention-guided masking scheme, and integrating multiple conditioning modalities (text and audio) for coherent control. The framework combines a Temporal Adaptive Transformer and a Spatial Aligning Transformer to align motion temporally and spatially with multimodal inputs, guided by an adaptive Attention-based Mask Modeling approach. A new Text-Music-Dance (TMD) dataset with 2,153 samples is introduced to benchmark multimodal conditioning, and Motion Anything achieves state-of-the-art or strong improvements across HumanML3D, KIT-ML, AIST++, and TMD, including a $15\%$ FID improvement on HumanML3D. The paper also demonstrates practical utility through a 4D avatar generation pipeline with a Selective Rigging Mechanism, illustrating end-to-end applicability from multimodal prompts to animated avatars. Overall, Motion Anything provides a versatile, precise framework for multimodal motion generation with strong empirical validation and new benchmark resources.

Abstract

Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Music-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything

Motion Anything: Any to Motion Generation

TL;DR

Motion Anything addresses two core challenges in multimodal motion generation: prioritizing dynamic frames and actions through an attention-guided masking scheme, and integrating multiple conditioning modalities (text and audio) for coherent control. The framework combines a Temporal Adaptive Transformer and a Spatial Aligning Transformer to align motion temporally and spatially with multimodal inputs, guided by an adaptive Attention-based Mask Modeling approach. A new Text-Music-Dance (TMD) dataset with 2,153 samples is introduced to benchmark multimodal conditioning, and Motion Anything achieves state-of-the-art or strong improvements across HumanML3D, KIT-ML, AIST++, and TMD, including a FID improvement on HumanML3D. The paper also demonstrates practical utility through a 4D avatar generation pipeline with a Selective Rigging Mechanism, illustrating end-to-end applicability from multimodal prompts to animated avatars. Overall, Motion Anything provides a versatile, precise framework for multimodal motion generation with strong empirical validation and new benchmark resources.

Abstract

Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Music-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website https://steve-zeyu-zhang.github.io/MotionAnything

Paper Structure

This paper contains 19 sections, 4 equations, 10 figures, 11 tables, 4 algorithms.

Figures (10)

  • Figure 1: User study form. The User Interface (UI) used in our user study.
  • Figure 2: Masking strategy comparison. This figure demonstrates the key differences between the previous random masking strategy guo2024momask (top) and our attention-based masking (bottom). Our masking strategy focuses on the more significant and dynamic parts of the motion (colored) corresponding to the condition.
  • Figure 2: Comparisons on FID and AIT. All tests are conducted on the same NVIDIA GeForce RTX 2080 Ti. The closer the model is to the origin, the better.
  • Figure 3: Motion Anything architecture. The multimodal architecture consists of several key components: (a) temporal and (c) spatial attention-based masking, (b) motion generator, and (d) a single block of motion generator. These components enable the model to learn key motions corresponding to the given conditions, and facilitate alignment between multi-modal conditions and motion features.
  • Figure 3: 4D Avatar Generation. This approach enables 4D avatar generation conditioned on multimodal inputs, achievable with just a single text prompt.
  • ...and 5 more figures