Table of Contents
Fetching ...

StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion

Ziyu Guo, Young Yoon Lee, Joseph Liu, Yizhak Ben-Shabat, Victor Zordan, Mubbasir Kapadia

TL;DR

StyleMotif introduces a stylized motion diffusion framework that unifies content- and multi-modal style-conditioned generation within a single-branch architecture. By implementing a style-content cross fusion mechanism and aligning a style encoder with a pre-trained multi-modal model, it delivers faithful style transfer across motion, text, image, audio, and video cues while preserving motion realism. The approach leverages a latent diffusion backbone (MLD) and a carefully pre-trained style encoder, with cross normalization to fuse style features into content features, achieving superior style expressiveness and efficiency. Extensive experiments demonstrate improved metrics over prior methods (e.g., $SRA$, $FID$, and MM Dist) and reveal emergent multi-modal stylization capabilities with practical applicability to animation, gaming, and virtual reality. The work highlights the potential of single-branch, multi-modal conditioning for flexible, high-quality stylized motion synthesis and points to future data-collection and generalization opportunities.

Abstract

We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion

TL;DR

StyleMotif introduces a stylized motion diffusion framework that unifies content- and multi-modal style-conditioned generation within a single-branch architecture. By implementing a style-content cross fusion mechanism and aligning a style encoder with a pre-trained multi-modal model, it delivers faithful style transfer across motion, text, image, audio, and video cues while preserving motion realism. The approach leverages a latent diffusion backbone (MLD) and a carefully pre-trained style encoder, with cross normalization to fuse style features into content features, achieving superior style expressiveness and efficiency. Extensive experiments demonstrate improved metrics over prior methods (e.g., , , and MM Dist) and reveal emergent multi-modal stylization capabilities with practical applicability to animation, gaming, and virtual reality. The work highlights the potential of single-branch, multi-modal conditioning for flexible, high-quality stylized motion synthesis and points to future data-collection and generalization opportunities.

Abstract

We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

Paper Structure

This paper contains 36 sections, 7 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparison of Our Proposed StyleMotif Framework with SMooDi. Unlike SMooDi’s dual-branch design, which increases model complexity and training overhead, StyleMotif employs a streamlined single-branch structure, enabling efficient multi-modal motion stylization while preserving motion realism.
  • Figure 2: Overall Pipeline of StyleMotif, a single diffusion branch framework for multi-modal motion stylization. Given a text prompt and a reference style from various modalities, our model extract style features and fuse them with content by style-content cross fusion. Through multi-modal alignment with contrastive learning, we enable seamless multi-modal conditioning and flexible stylization across motion, text, images, audio, and video.
  • Figure 3: Quantitative Results for Motion-Guided and Text-Guided Stylization.Bold values denote the best performance. As there is no ground-truth reference for Diversity, no value is highlighted in bold; but the metric is provided for reference.
  • Figure 4: Quantitative Results of Motion Style Transfer on HumanML3D Guo_2022_CVPR dataset. Our method outperforms previous works in all metrics, which demonstrates effective style-content fusion for high-quality motion style transfer, providing significant advantages for downstream tasks besides motion stylization.
  • Figure 5: Qualitative Results of Motion-Guided Stylization. Our model generates cohesive and realistic motions that effectively align style and content, such as preserving the 'circular' trajectory (first column) and 'hop' content (third column). In contrast, SMooDi zhong2025smoodi struggles to maintain content fidelity and sometimes fails to reflect the specified style (e.g., 'phone on the left' in the second colum).
  • ...and 5 more figures