Table of Contents
Fetching ...

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov

TL;DR

MaskControl tackles the challenge of precise, high-quality text-to-motion generation with flexible joint control by introducing controllability to generative masked motion models. It combines a training-time Logits Regularizer with inference-time Logits Optimization and Differentiable Expectation Sampling (DES) to steer motion-token distributions toward target joint positions while preserving generation quality. The approach supports any-joint-any-frame control, body-part timeline control, and zero-shot objective control, and it demonstrates substantial improvements over state-of-the-art baselines on HumanML3D in both objective metrics (FID and average error) and qualitative realism. This work enables versatile, fast, and zero-shot adaptable motion synthesis suitable for animation, VR/AR, and robotics applications.

Abstract

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

TL;DR

MaskControl tackles the challenge of precise, high-quality text-to-motion generation with flexible joint control by introducing controllability to generative masked motion models. It combines a training-time Logits Regularizer with inference-time Logits Optimization and Differentiable Expectation Sampling (DES) to steer motion-token distributions toward target joint positions while preserving generation quality. The approach supports any-joint-any-frame control, body-part timeline control, and zero-shot objective control, and it demonstrates substantial improvements over state-of-the-art baselines on HumanML3D in both objective metrics (FID and average error) and qualitative realism. This work enables versatile, fast, and zero-shot adaptable motion synthesis suitable for animation, VR/AR, and robotics applications.

Abstract

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77\%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/

Paper Structure

This paper contains 31 sections, 11 equations, 12 figures, 12 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overall architecture of MaskControl. (a) Motion Tokenizer transforms the motion sequence into discrete motion tokens. (b) Differentiable Expectation Sampling (DES) is a differentiable sampling from logits enabling differentiable conversion between discrete tokens in codebook space and transformer token space. (c) Training: Logits Regularizer ensures high-quality motion by generating embedding closely aligns with joint control signals during an unmasking process. (d) Inference: Logits Optimization guides logits during the unmasking process at inference time based on the objective function.
  • Figure 2: Visualization comparisons to state-of-the-art methods for any-joint any-frame control. The plots on the top display the top view of pelvis control (root trajectory), while the bottom plot shows the side view of the right wrist. Red represents the control signal, and Blue represents the generated joint motion.
  • Figure 3: Visualization comparisons to state-of-the-art methods for zero-shot objective control. Objective: constrain a human to walk inside a square area.
  • Figure 4: Generating body parts timeline for STMC setting.
  • Figure 5: Comparison of FID score, spatial control error, and motion generation speed (circle size) for our accurate and fast models comparing to state-of-the-art models. The closer the point is to the origin and the smaller the circle, the better performance.
  • ...and 7 more figures