Less is More: Improving Motion Diffusion Models with Sparse Keyframes
Jinseok Bae, Inwoo Hwang, Young Yoon Lee, Ziyu Guo, Joseph Liu, Yizhak Ben-Shabat, Young Min Kim, Mubbasir Kapadia
TL;DR
This work tackles the high computational burden and limited controllability of dense-frame motion diffusion models by introducing Sparse Motion Diffusion Model (sMDM), a keyframe-centric diffusion framework that masks non-keyframes and reconstructs dense frames via feature-space interpolation. It combines Visvalingam-Whyatt keyframe selection, Lipschitz-regularized input/output mappings, and a dynamic inference mask that emphasizes informative frames in later diffusion steps, reducing self-attention complexity from $O(N^2)$ to about $O(K^2)$. Empirically, sMDM yields stronger text alignment and motion realism than baselines across text-to-motion, long-sequence generation, and autoregressive control tasks, while maintaining high quality at fewer diffusion steps. The approach also demonstrates robustness as a generative prior and generalizes across architectures and downstream tasks, potentially aligning diffusion-based motion synthesis more closely with professional animation workflows.
Abstract
Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks.
