Table of Contents
Fetching ...

Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Jinseok Bae, Inwoo Hwang, Young Yoon Lee, Ziyu Guo, Joseph Liu, Yizhak Ben-Shabat, Young Min Kim, Mubbasir Kapadia

TL;DR

This work tackles the high computational burden and limited controllability of dense-frame motion diffusion models by introducing Sparse Motion Diffusion Model (sMDM), a keyframe-centric diffusion framework that masks non-keyframes and reconstructs dense frames via feature-space interpolation. It combines Visvalingam-Whyatt keyframe selection, Lipschitz-regularized input/output mappings, and a dynamic inference mask that emphasizes informative frames in later diffusion steps, reducing self-attention complexity from $O(N^2)$ to about $O(K^2)$. Empirically, sMDM yields stronger text alignment and motion realism than baselines across text-to-motion, long-sequence generation, and autoregressive control tasks, while maintaining high quality at fewer diffusion steps. The approach also demonstrates robustness as a generative prior and generalizes across architectures and downstream tasks, potentially aligning diffusion-based motion synthesis more closely with professional animation workflows.

Abstract

Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.

Less is More: Improving Motion Diffusion Models with Sparse Keyframes

TL;DR

This work tackles the high computational burden and limited controllability of dense-frame motion diffusion models by introducing Sparse Motion Diffusion Model (sMDM), a keyframe-centric diffusion framework that masks non-keyframes and reconstructs dense frames via feature-space interpolation. It combines Visvalingam-Whyatt keyframe selection, Lipschitz-regularized input/output mappings, and a dynamic inference mask that emphasizes informative frames in later diffusion steps, reducing self-attention complexity from to about . Empirically, sMDM yields stronger text alignment and motion realism than baselines across text-to-motion, long-sequence generation, and autoregressive control tasks, while maintaining high quality at fewer diffusion steps. The approach also demonstrates robustness as a generative prior and generalizes across architectures and downstream tasks, potentially aligning diffusion-based motion synthesis more closely with professional animation workflows.

Abstract

Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.

Paper Structure

This paper contains 24 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: We propose a keyframe-centric framework for training motion diffusion models. Our method, namely Sparse Motion Diffusion Model (sMDM), outperforms baseline Motion Diffusion Model (MDM)tevet2022human, achieving more stable and precise motion generation while better capturing text prompts.
  • Figure 2: Model architectures of Sparse Motion Diffusion Model (sMDM). Our sMDM uses a binary keyframe mask $\mathbf{M}$ to exclude non-keyframes from the self-attention layers. During training, $\mathbf{M}$ is derived from the clean input $\mathbf{x}_0$ via keyframe selection visvalingam1993line. At inference, the model starts with a uniform keyframe mask at earlier timesteps ($t > T'$), then updates $\mathbf{M}$ by selecting keyframes from $\mathbf{x}_t$ for later timesteps ($t \leq T'$). Finally, sMDM reconstructs the dense motion by linearly interpolating features of the selected keyframes. To ensure smooth interpolation, we replace the input and output linear layers with Lipschitz MLPs liu2022learning. Red boxes indicate the changes from the baseline MDM tevet2022human.
  • Figure 3: Qualitative evaluations on text-to-motion generation results. We indicate the evaluation results with a red cross or green circle, highlighting which parts of the text prompt are missing in the generated motions. Unlike our method, MDM tevet2022human frequently overlooks parts of the input text command. Similarly, although MotionGPT jiang2023motiongpt utilizes an advanced text encoder like sMDM-stella, it struggles to capture fine-grained styles (middle) and contextual details (bottom). In contrast, our models faithfully adhere to the input text commands.
  • Figure 4: Visualization of long-sequence motion generation using the Double Take strategy shafir2023human. Given four sequential text prompts, the pretrained model generates a continuous motion sequence by conditioning on the input prompts. To enhance interpretability, we separate each segment in a different frame. PriorMDM and sPriorMDM refer to models using the standard MDM and our sparse MDM (sMDM), respectively. Consistent with our findings in regular text-conditioned motion generation (Table \ref{['tab:humanml3d']}), our model effectively captures fine-grained motion details described in the prompts, whereas the baseline model frequently fails to fully adhere to the input instructions. Furthermore, ours can generate natural transitions that are smoothly aligned with the neighboring motion segments.