Table of Contents
Fetching ...

SMooDi: Stylized Motion Diffusion Model

Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, Huaizu Jiang

TL;DR

SMooDi tackles the challenge of generating stylized human motion driven by content text and a style reference by adapting a pre-trained motion diffusion model in the latent space and injecting style via a dedicated adaptor and guidance. It introduces a style adaptor that adds residuals conditioned on a style embedding and employs both classifier-free and classifier-based guidance, including a differentiable style distance $G({\bm{z}}_t, t, \mathbf{s})$, to balance content fidelity and style realization. The learning scheme combines a standard denoising loss $L_{std}$ with a content-preservation loss $L_{pr}$ and a cycle-prior-preservation loss $L_{cyc}$ to mitigate content-forgetting and encourage robust content-style translations, trained on HumanML3D and 100STYLE data. Empirical results on stylized text2motion and motion style transfer demonstrate state-of-the-art or competitive performance in content preservation, style reflection, and realism, without per-style finetuning, highlighting SMooDi’s practical impact for flexible, scalable stylized motion generation and transfer.

Abstract

We introduce a novel Stylized Motion Diffusion model, dubbed SMooDi, to generate stylized motion driven by content texts and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. To this end, we tailor a pre-trained text-to-motion model for stylization. Specifically, we propose style guidance to ensure that the generated motion closely matches the reference style, alongside a lightweight style adaptor that directs the motion towards the desired style while ensuring realism. Experiments across various applications demonstrate that our proposed framework outperforms existing methods in stylized motion generation.

SMooDi: Stylized Motion Diffusion Model

TL;DR

SMooDi tackles the challenge of generating stylized human motion driven by content text and a style reference by adapting a pre-trained motion diffusion model in the latent space and injecting style via a dedicated adaptor and guidance. It introduces a style adaptor that adds residuals conditioned on a style embedding and employs both classifier-free and classifier-based guidance, including a differentiable style distance , to balance content fidelity and style realization. The learning scheme combines a standard denoising loss with a content-preservation loss and a cycle-prior-preservation loss to mitigate content-forgetting and encourage robust content-style translations, trained on HumanML3D and 100STYLE data. Empirical results on stylized text2motion and motion style transfer demonstrate state-of-the-art or competitive performance in content preservation, style reflection, and realism, without per-style finetuning, highlighting SMooDi’s practical impact for flexible, scalable stylized motion generation and transfer.

Abstract

We introduce a novel Stylized Motion Diffusion model, dubbed SMooDi, to generate stylized motion driven by content texts and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. To this end, we tailor a pre-trained text-to-motion model for stylization. Specifically, we propose style guidance to ensure that the generated motion closely matches the reference style, alongside a lightweight style adaptor that directs the motion towards the desired style while ensuring realism. Experiments across various applications demonstrate that our proposed framework outperforms existing methods in stylized motion generation.
Paper Structure (24 sections, 9 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: SMooDi can generate realistic, stylized human motions given a content text a style motion sequence. It also accepts a motion sequence as content input. Darker color indicates later frames in the sequence. To better showcase the stylized motion generation, we place the style label for the each of the style motion sequence. Note that such style labels are not used as model input and shown here for visualization purpose only. (Best viewed in color.)
  • Figure 2: Overview of SMooDi. Our model generates stylized human motions from content text and a style motion sequence. At the denoising step $t$, our model takes the content text $\mathbf{c}$, style motion $\mathbf{s}$, and noisy latent $\mathbf{z}_t$ as input and predicts $\epsilon_t$, which is then transferred to ${\bm{z}}_{t-1}$. This denoising step is repeated $T$ times to obtain the noise-free motion latent $\mathbf{z}_0$, which is fed into a motion decoder $D$ to produce the stylized motion.
  • Figure 3: Detailed illustration of our proposed style adaptor. The style adaptor is connected to the motion diffusion model via zero linear layer. The output of the style adaptor from each Transformer encoder is added to the motion diffusion model to steer the predicted noise towards the target style.
  • Figure 4: Visual illustrations of the classifier-free and clasifier-based style guidance. (a) and (b) respectively show the classifier-free content and style guidance; (c) displays the initial stylized motion resulting from the combination of (a) and (b); (d) illustrates the refined stylized motion modified by the classifier-based style guidance.
  • Figure 5: Qualitative comparisons of our approach and baseline methods on two stylized motion generation task.
  • ...and 7 more figures