Table of Contents
Fetching ...

Single Motion Diffusion

Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H. Bermano, Daniel Cohen-Or

TL;DR

We address motion synthesis from extremely limited data by learning from a single motion sequence with arbitrary topology. The proposed SinMDM uses a lightweight diffusion model with a QnA-based UNet to capture core motion motifs while restricting receptive field, enabling accurate, diverse, long-range motion generation without re-training for downstream tasks. The approach delivers strong quantitative and qualitative results across Mixamo and HumanML3D benchmarks, outperforms prior single-motion methods, and enables versatile inference-time applications such as in-betweening, expansion, harmonization, style transfer, and crowd animation. This demonstrates that diffusion models can effectively operate with minimal data in the motion domain, offering practical, scalable tools for animators handling exotic skeletons and varied topologies.

Abstract

Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize motions of arbitrary length that are faithful to them. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is designed to be a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. SinMDM can be applied in various contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both in quality and time-space efficiency. Moreover, while current approaches require additional training for different applications, our work facilitates these applications at inference time. Our code and trained models are available at https://sinmdm.github.io/SinMDM-page.

Single Motion Diffusion

TL;DR

We address motion synthesis from extremely limited data by learning from a single motion sequence with arbitrary topology. The proposed SinMDM uses a lightweight diffusion model with a QnA-based UNet to capture core motion motifs while restricting receptive field, enabling accurate, diverse, long-range motion generation without re-training for downstream tasks. The approach delivers strong quantitative and qualitative results across Mixamo and HumanML3D benchmarks, outperforms prior single-motion methods, and enables versatile inference-time applications such as in-betweening, expansion, harmonization, style transfer, and crowd animation. This demonstrates that diffusion models can effectively operate with minimal data in the motion domain, offering practical, scalable tools for animators handling exotic skeletons and varied topologies.

Abstract

Synthesizing realistic animations of humans, animals, and even imaginary creatures, has long been a goal for artists and computer graphics professionals. Compared to the imaging domain, which is rich with large available datasets, the number of data instances for the motion domain is limited, particularly for the animation of animals and exotic creatures (e.g., dragons), which have unique skeletons and motion patterns. In this work, we present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to learn the internal motifs of a single motion sequence with arbitrary topology and synthesize motions of arbitrary length that are faithful to them. We harness the power of diffusion models and present a denoising network explicitly designed for the task of learning from a single input motion. SinMDM is designed to be a lightweight architecture, which avoids overfitting by using a shallow network with local attention layers that narrow the receptive field and encourage motion diversity. SinMDM can be applied in various contexts, including spatial and temporal in-betweening, motion expansion, style transfer, and crowd animation. Our results show that SinMDM outperforms existing methods both in quality and time-space efficiency. Moreover, while current approaches require additional training for different applications, our work facilitates these applications at inference time. Our code and trained models are available at https://sinmdm.github.io/SinMDM-page.
Paper Structure (31 sections, 9 equations, 16 figures, 5 tables)

This paper contains 31 sections, 9 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: SinMDM learns the internal motion motifs from a single motion sequence with arbitrary topology and synthesizes motions that are faithful to the learned core motifs of the input sequence. Top: a girl exercising while walking. Bottom: a breakdancing dragon. Left to right: breakdance uprock, breakdance freeze, and breakdance flair.
  • Figure 2: Left: To allow training on a single motion, our denoising network is designed such that its overall receptive field covers only a portion of the input sequence. This effectively allows the network to simultaneously learn from multiple local temporal motion segments. Our denoiser predicts the input sequence from a noisy one. $x_t^0\dots x_t^N$ is a motion of $N$ frames at diffusion step $t$. Right: Our network is a shallow UNet, enhanced with a QnA local attention layer.
  • Figure 3: Motion composition. Parts from a reference motion $y$, are composed with the synthesized motion $\hat{x}_0$, according to a composition map.
  • Figure 4: Temporal composition -- In-betweening. Both top and bottom show results for the same input, introducing diverse outputs. The beginning and the end of the motion are given by the reference sequence and can be distinguished according to their faded tone. Observe that the beginning and the end are identical in both sequences. The center of each motion is synthesized.
  • Figure 5: Temporal composition -- motion expansion. Pairs of motions exhibit diverse synthesis from a single input. The motion part provided by the reference sequence is identifiable by its faded color. Note that the parts given as input are identical in both sequences, while the synthesized parts differ. Top: synthesize a suffix given a temporal prefix. Bottom: synthesize a prefix and a suffix, given the middle part.
  • ...and 11 more figures