Table of Contents
Fetching ...

Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation

Junkun Jiang, Jie Chen, Ho Yin Au, Jingyu Xiang

TL;DR

The Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture, and focuses on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture.

Abstract

Vision-based motion capture solutions often struggle with occlusions, which result in the loss of critical joint information and hinder accurate 3D motion reconstruction. Other wearable alternatives also suffer from noisy or unstable data, often requiring extensive manual cleaning and correction to achieve reliable results. To address these challenges, we introduce the Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture. Central to our design is the Kinematic Attention Aggregation (KAA) mechanism, which enables efficient, deep, and iterative encoding of both joint-level and pose-level features, capturing structural and temporal motion patterns essential for task-specific reconstruction. We focus on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture, where each learned prior emphasizes different aspects of motion dynamics and is specifically efficient for its corresponding task. This enables the architecture to adaptively specialize without altering its structure. Such versatility allows MMDM to efficiently learn motion priors tailored to scenarios such as motion refinement, completion, and in-betweening. Extensive evaluations on public benchmarks demonstrate that MMDM achieves strong performance across diverse masking strategies and task settings. The source code is available at https://github.com/jjkislele/MMDM.

Learning Context-Adaptive Motion Priors for Masked Motion Diffusion Models with Efficient Kinematic Attention Aggregation

TL;DR

The Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture, and focuses on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture.

Abstract

Vision-based motion capture solutions often struggle with occlusions, which result in the loss of critical joint information and hinder accurate 3D motion reconstruction. Other wearable alternatives also suffer from noisy or unstable data, often requiring extensive manual cleaning and correction to achieve reliable results. To address these challenges, we introduce the Masked Motion Diffusion Model (MMDM), a diffusion-based generative reconstruction framework that enhances incomplete or low-confidence motion data using partially available high-quality reconstructions within a Masked Autoencoder architecture. Central to our design is the Kinematic Attention Aggregation (KAA) mechanism, which enables efficient, deep, and iterative encoding of both joint-level and pose-level features, capturing structural and temporal motion patterns essential for task-specific reconstruction. We focus on learning context-adaptive motion priors, specialized structural and temporal features extracted by the same reusable architecture, where each learned prior emphasizes different aspects of motion dynamics and is specifically efficient for its corresponding task. This enables the architecture to adaptively specialize without altering its structure. Such versatility allows MMDM to efficiently learn motion priors tailored to scenarios such as motion refinement, completion, and in-betweening. Extensive evaluations on public benchmarks demonstrate that MMDM achieves strong performance across diverse masking strategies and task settings. The source code is available at https://github.com/jjkislele/MMDM.
Paper Structure (50 sections, 7 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 50 sections, 7 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Architecture comparison of the proposed Masked Motion Diffusion Model (MMDM) against other methods. (a) Masked Autoencoders (MAEs) jiang2022dualyan2023skeletonmaestoffl2024elucidating reconstruct masked (low-confidence) joints from unmasked (visible/high-confidence) joints, but they are not designed for noisy input. (b) Motion diffusion models gong2023diffposekapon2024mas denoise pose sequences to generate high-quality motions, which typically require complete input tokens. (c) Our MMDM combines both paradigms, taking partial, noisy inputs and fusing joint- and pose-level representations via the proposed Kinematic Attention Aggregation (KAA) to output complete, high-quality motions.
  • Figure 1: Demonstration for the reserve diffusion process at $k$ time step. Green and blue skeletons denote the ground truth and the prediction, respectively. The masked joints are first sampled from a normal distribution and iteratively denoised.
  • Figure 2: Illustration of the reverse diffusion process in the proposed Masked Motion Diffusion Model (MMDM). It begins at iteration $k=K$ and proceeds sequentially to $k=0$, reconstructing the masked motion sequence $\mathbf{d}^{m}_k$ from Gaussian noise into high‐quality data, conditioned on the unmasked motion sequence $\mathbf{d}^{\overline{m}}$. Specifically, at each iteration $k$, (a) the Kinematic Encoder encodes the unmasked joints to latent tokens $\mathbf{h}^{\overline{m}}$ and yields the kinematic condition $c$, and (b) the Motion Decoder decodes the concatenated tokens $[\mathbf{h}^{\overline{m}}; \mathbf{z}_{k}^{m}]$ conditioned on $c$. To preserve motion context, the unmasked output $\mathbf{d}^{\overline{m}}_{k}$ is replaced by the input $\mathbf{d}^{\overline{m}}$ at every step. Positional encoding involves joint indices, frame numbers, and diffusion step indices are incorporated into the hidden features before each encoding and decoding stage.
  • Figure 3: Qualitative comparisons of motion capture performance. Both 2D projections and new perspectives of 3D renderings are shown. In other methods, green and blue skeletons represent the ground truth and motion capture results, respectively. For our approach, applied in the motion completion setting, green, blue, and red skeletons differentiate the ground truth, motion capture results, and completion results. Red dotted boxes highlight the failure cases, indicating that our method yields more accurate results.
  • Figure 4: Qualitative comparisons for motion in-betweening task. Motion sequences are sampled into key poses at a fixed ratio. Grey segments illustrate preceding and succeeding parts, while the transitioning part is color-coded from yellow to purple in a rainbow gradient, indicating the chronological order. We emphasize the joint trajectories of the pelvis, elbows, shoulders, knees, and four end-effectors. Our model generates trajectories that are closest to the ground truth, whereas other methods suffer from issues such as over-smoothing and jitter.
  • ...and 2 more figures