A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Esteve Valls Mascaro; Hyemin Ahn; Dongheui Lee

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee

TL;DR

This work addresses the fragmentation of human motion synthesis tasks by proposing UNIMASK-M, a task‑agnostic model that treats forecasting, inbetweening, and reconstruction as a single masked reconstruction problem on a motion sequence $\mathbf{X}$ with mask $\mathbf{M}$. It introduces Pose Decomposition to partition a pose into limb‑based patches and employs a ViT‑based encoder/decoder with mixed embeddings to exploit spatio‑temporal relations and masking information. Key contributions include the patchified skeleton approach (PD), the mixed embeddings strategy ($emb_{mix}$, $emb_{pos}$, $emb_{kin}$, $emb_{mask}$), and a Pose Aggregation module (PA) that fuses partial observations into coherent full poses, achieving state‑of‑the‑art results on motion inbetweening (LaFAN1) and competitive forecasting (Human3.6M) while robustly handling occlusions. The method is efficient for real‑time synthesis and demonstrates strong cross‑task robustness, indicating a promising direction for unified motion synthesis research.

Abstract

The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://evm7.github.io/UNIMASKM-page/

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

TL;DR

with mask

. It introduces Pose Decomposition to partition a pose into limb‑based patches and employs a ViT‑based encoder/decoder with mixed embeddings to exploit spatio‑temporal relations and masking information. Key contributions include the patchified skeleton approach (PD), the mixed embeddings strategy (

), and a Pose Aggregation module (PA) that fuses partial observations into coherent full poses, achieving state‑of‑the‑art results on motion inbetweening (LaFAN1) and competitive forecasting (Human3.6M) while robustly handling occlusions. The method is efficient for real‑time synthesis and demonstrates strong cross‑task robustness, indicating a promising direction for unified motion synthesis research.

Abstract

Paper Structure (9 sections, 1 equation, 9 figures, 5 tables)

This paper contains 9 sections, 1 equation, 9 figures, 5 tables.

Introduction
Related Work
Human Motion Synthesis
Masked Autoencoders
Methodology
Experimental results
Quantitative evaluation
Ablation Study
Conclusion

Figures (9)

Figure 1: Unified architecture for different human motion synthesis tasks. Green and purple skeletons denote a known skeleton joint, while light red and green represents our model prediction over a masked joint.
Figure 2: UNIMASK-M architecture. Let a human motion $\mathbf{X}$ and its respective binary mask $M$. We first interpolate $\mathbf{X}_g$ to obtain $\mathbf{X}_{fill}$ and provide consistency to the input. Then, our Pose Decomposition module (PD) deconstructs each pose $\mathbf{p}_t$ into a sequence of patches $\hat{\mathbf{p}}_t$, which we project and flatten to a sequence of tokens $\mathbf{E}$. We add the $emb_{mix}$ to $\mathbf{E}$ to inform the encoder and decoder about the masked tokens and the spatio-temporal structure. Our ViT-based encoder and decoder reconstruct the sequence of tokens. Our Pose Aggregation module (PA) regroups the decoded tokens into poses using an MLP layer. Finally, each pose is projected back to the joint representations and summed to our reference motion $\mathbf{X}_{ref}$.
Figure 3: Mixed embedding strategy. The mixed embeddings are obtained by summing (i) a masking token to identify the masked patches ($emb_{mask}$); (ii) $L=5$ spatial learnable parameters that correspond to each body part ($emb_{kin}$); and (iii) a sinusoidal position embedding ($emb_{pos}$).
Figure 4: Adaption of UNIMASK-M using DCT and TempMLP. First, we apply Discrete Cosine Transformation (DCT) and Inverted DCT (IDCT) to encode and decode the given motion. Additionally, we adopt a Temporal MLP (TempMLP) module to refine the predicted pose sequence through $M$ blocks of fully connected layers (FC), Layer Normalization (LN) and a residual connection.
Figure 5: Comparison of the motion forecasting task. Predicted skeletons are shown in red and blue.
...and 4 more figures

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

TL;DR

Abstract

A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (9)