Table of Contents
Fetching ...

Flexible Motion In-betweening with Diffusion Models

Setareh Cohan, Guy Tevet, Daniele Reda, Xue Bin Peng, Michiel van de Panne

TL;DR

Problem: generating plausible human motion interpolations guided by sparse keyframes and text prompts. Approach: CondMDI, a unified diffusion-based framework that supports flexible keyframe conditioning via an observation mask and optional guidance. Contributions: random-keyframe training, masked conditional reverse diffusion, and comprehensive HumanML3D evaluations showing strong fidelity, diversity, and efficiency. Impact: enables practical, user-guided animation workflows and demonstrates the viability of diffusion models for flexible keyframe in-betweening.

Abstract

Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.

Flexible Motion In-betweening with Diffusion Models

TL;DR

Problem: generating plausible human motion interpolations guided by sparse keyframes and text prompts. Approach: CondMDI, a unified diffusion-based framework that supports flexible keyframe conditioning via an observation mask and optional guidance. Contributions: random-keyframe training, masked conditional reverse diffusion, and comprehensive HumanML3D evaluations showing strong fidelity, diversity, and efficiency. Impact: enables practical, user-guided animation workflows and demonstrates the viability of diffusion models for flexible keyframe in-betweening.

Abstract

Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.
Paper Structure (36 sections, 11 equations, 6 figures, 6 tables, 3 algorithms)

This paper contains 36 sections, 11 equations, 6 figures, 6 tables, 3 algorithms.

Figures (6)

  • Figure 1: Conditional Motion Diffusion In-betweening (CondMDI) overview. The model is fed a noisy motion sequence ${\mathbf{x}}_t$, the diffusion step $t$, a text prompt ${\mathbf{p}}$, and a keyframe control signal ${\mathbf{c}}$. Text prompt ${\mathbf{p}}$ is first fed into a CLIP-based radford2021learning textual embedder before being fed into the motion diffusion model which is based on GMD karunratanakul2023guided. Mask Extractor module extracts the binary mask and the Masked Sum module performs the masked addition $\tilde{{\mathbf{x}}}_t = {\mathbf{m}} \odot {\mathbf{c}} + (\mathbf{1} - {\mathbf{m}}) \odot {\mathbf{x}}_t$ and the gray box around $\tilde{{\mathbf{x}}}_t$ and ${\mathbf{m}}$ indicates concatenation of the two.
  • Figure 2: Our model is capable of generating high-quality motions in hard moves such as a karate kick or a yoga sun salutation pose. Check the video for the full motions.
  • Figure 3: A walking motion conditioned only on the root joint (left) and only on the right wrist (right).
  • Figure 4: Ablation results on a simple S-walking motion, with keyframes equally spaced $=30$ frames apart. While Imputation alone fails to follow the keyframes, Imputation with guidance is able to do so but suffers from jitters and inconsistencies. $C$ indicates the denoising step in which replacement stops. For a better look please refer to the supplementary video.
  • Figure 5: Different motions generated with the same conditioning keyframes. After the last keyframe in blue in (a), the motions (displayed in different colors) show diverse and coherent behavior over time (from left to right). Please refer to the supplementary video for a dynamic version with more samples.
  • ...and 1 more figures