StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion
Ziyu Guo, Young Yoon Lee, Joseph Liu, Yizhak Ben-Shabat, Victor Zordan, Mubbasir Kapadia
TL;DR
StyleMotif introduces a stylized motion diffusion framework that unifies content- and multi-modal style-conditioned generation within a single-branch architecture. By implementing a style-content cross fusion mechanism and aligning a style encoder with a pre-trained multi-modal model, it delivers faithful style transfer across motion, text, image, audio, and video cues while preserving motion realism. The approach leverages a latent diffusion backbone (MLD) and a carefully pre-trained style encoder, with cross normalization to fuse style features into content features, achieving superior style expressiveness and efficiency. Extensive experiments demonstrate improved metrics over prior methods (e.g., $SRA$, $FID$, and MM Dist) and reveal emergent multi-modal stylization capabilities with practical applicability to animation, gaming, and virtual reality. The work highlights the potential of single-branch, multi-modal conditioning for flexible, high-quality stylized motion synthesis and points to future data-collection and generalization opportunities.
Abstract
We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io
