Table of Contents
Fetching ...

MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

Yilin Wang, Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Xinxin Zuo, Juwei Lu, Hai Jiang, Li Cheng

TL;DR

MotionDreamer addresses data scarcity in motion synthesis by learning an explicit, discrete latent space of local motion patterns from a single reference. It tokenizes motion using a codebook of size $K$ and models local dependencies with a sliding-window local attention mechanism, SlidAttn, while training with a masked token objective and a differentiable dequantization via sparsemax. A codebook distribution regularization term $\mathcal{L}_{\text{token}}$ based on KL divergence promotes uniform usage of code entries, reducing codebook collapse. The approach yields state-of-the-art performance on faithfulness and diversity, supports downstream tasks such as temporal editing and beat-aligned dance, and scales to arbitrary sequence lengths.

Abstract

Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: https://motiondreamer.github.io/

MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

TL;DR

MotionDreamer addresses data scarcity in motion synthesis by learning an explicit, discrete latent space of local motion patterns from a single reference. It tokenizes motion using a codebook of size and models local dependencies with a sliding-window local attention mechanism, SlidAttn, while training with a masked token objective and a differentiable dequantization via sparsemax. A codebook distribution regularization term based on KL divergence promotes uniform usage of code entries, reducing codebook collapse. The approach yields state-of-the-art performance on faithfulness and diversity, supports downstream tasks such as temporal editing and beat-aligned dance, and scales to arbitrary sequence lengths.

Abstract

Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: https://motiondreamer.github.io/

Paper Structure

This paper contains 26 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of the one-to-many motion synthesis. A single reference motion with arbitrary skeletons can be applied to generate natural and diverse novel motions while preserving the reference local motion patterns. Above shows the diverse generations from MotionDreamer of a girl doing breakdance (upper); a jaguar attacking (bottom).
  • Figure 2: (a) Overview of MotionDreamer based on localized generative masked transformer. The single reference motion ${\bm{m}}_{1:L}$ is embedded as motion tokens ${\bm{c}}$ by optimizing a codebook through vector quantization, where a codebook distribution regularization loss $\mathcal{L}_{\text{token}}$ is additionally introduced. The Local-M transformer learns the local dependencies of motion tokens through sliding window local attention (SlidAttn) layers. The SlidAttn layer attends tokens within each unfolded overlapping window for attention based on learnable query and relative positional embeddings. Attention outputs are merged through overlap attention fusion (AttnFuse). (b) Visualization of the explicit distribution modeling for internal patterns. MotionDreamer learns to express and diversify the combination of internal patterns with explicit categorical distribution of motion tokens, which is visualized as multiple token candidates predicted by Local-M given previous generated ones.
  • Figure 3: Qualitative comparison on "hiphop dance" sample from Mixamo. Pattern A and B refer to two difficult patterns presented in the reference motion. Patterns that show up in generated motions are framed out marked as either success or failure according to its quality.
  • Figure 4: Score distribution and average score results from user study. The score level ranges from 1 to 5 of assessing Coverage, Diversity and Naturalness. The bars align with the right y-axis referring to percentage of votes of each method in each score level, and the horizontal lines align with left y-axis labeling the average score of each method.
  • Figure 5: Ablation study on codebook distribution regularization technique based on optimizing $\mathcal{L}_{\text{token}}$. Color closer to green representing higher per-frame similarity while color closer to orange referring to lower similarity.
  • ...and 3 more figures