SMooDi: Stylized Motion Diffusion Model
Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, Huaizu Jiang
TL;DR
SMooDi tackles the challenge of generating stylized human motion driven by content text and a style reference by adapting a pre-trained motion diffusion model in the latent space and injecting style via a dedicated adaptor and guidance. It introduces a style adaptor that adds residuals conditioned on a style embedding and employs both classifier-free and classifier-based guidance, including a differentiable style distance $G({\bm{z}}_t, t, \mathbf{s})$, to balance content fidelity and style realization. The learning scheme combines a standard denoising loss $L_{std}$ with a content-preservation loss $L_{pr}$ and a cycle-prior-preservation loss $L_{cyc}$ to mitigate content-forgetting and encourage robust content-style translations, trained on HumanML3D and 100STYLE data. Empirical results on stylized text2motion and motion style transfer demonstrate state-of-the-art or competitive performance in content preservation, style reflection, and realism, without per-style finetuning, highlighting SMooDi’s practical impact for flexible, scalable stylized motion generation and transfer.
Abstract
We introduce a novel Stylized Motion Diffusion model, dubbed SMooDi, to generate stylized motion driven by content texts and style motion sequences. Unlike existing methods that either generate motion of various content or transfer style from one sequence to another, SMooDi can rapidly generate motion across a broad range of content and diverse styles. To this end, we tailor a pre-trained text-to-motion model for stylization. Specifically, we propose style guidance to ensure that the generated motion closely matches the reference style, alongside a lightweight style adaptor that directs the motion towards the desired style while ensuring realism. Experiments across various applications demonstrate that our proposed framework outperforms existing methods in stylized motion generation.
