Table of Contents
Fetching ...

Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

Pengfei Zhou, Xiangyue Zhang, Xukun Shen, Yong Hu

Abstract

Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: https://xiangyue-zhang.github.io/DynMask

Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors

Abstract

Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: https://xiangyue-zhang.github.io/DynMask

Paper Structure

This paper contains 19 sections, 22 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Motivation.Left: standard masked motion generation allocates uniform modeling budget across all frames, applying the same masking, attention, and sampling regardless of local motion difficulty---this leads to degraded quality on dynamically complex motion. Right: DynMask uses the Motion Spectral Descriptor (MSD), a DCT-based per-frame complexity signal, to make masked generation complexity-aware: more supervision, stronger attention exchange, and broader decoding exploration are directed toward dynamically difficult frames, yielding better motion quality.
  • Figure 2: Overview of DynMask.(a) Core and full framework. Given motion tokens from a VQ-VAE and a text condition from a frozen CLIP encoder, DynMask computes one motion-grounded complexity signal, the Motion Spectral Descriptor (MSD), and reuses it throughout masked generation. In the core model, MSD guides content-focused mask selection and motion-aware attention inside the masked transformer. In the full model, the same signal is further used by complexity-aware decoding at inference time. (b) MSD computation. For each frame, we compute token-embedding velocity, apply a sliding-window Type-II DCT, and obtain both a frame-level spectral descriptor $\boldsymbol{\phi}_t$ and a scalar complexity summary $\Omega(t)$. (c) Component details. Motion-aware attention blends learned attention logits with MSD-derived spectral similarity using a layer-decayed coefficient, while complexity-aware decoding assigns higher temperature and noise to dynamically harder frames.
  • Figure 3: MSD spectral fingerprints. Representative MSD heatmaps for different motion types show distinct temporal-frequency patterns across time.
  • Figure 4: Qualitative comparison on challenging dynamic prompts. Compared with representative masked baselines, DynMask better preserves dynamic timing, side-specific limb actions, kick extension, running gait, jump takeoff, and airborne phases. Green circles highlight locally correct dynamic details, while red circles mark failure cases in the baselines.
  • Figure 5: Attention visualization on a multi-phase turning-and-spinning sequence.(a) The MoMask baseline produces comparatively diffuse attention and weak phase separation. (b) DynMask yields clearer phase-structured attention aligned with the underlying motion segments. (c) The MSD spectral similarity matrix shows strong within-phase consistency, explaining why motion-aware attention connects dynamically compatible regions more effectively.