Table of Contents
Fetching ...

MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow

Zhe Li, Yisheng He, Lei Zhong, Weichao Shen, Qi Zuo, Lingteng Qiu, Zilong Dong, Laurence Tianruo Yang, Weihao Yuan

TL;DR

MulSMo tackles the challenge of stylized motion generation by introducing a bidirectional control flow between the style and content networks and enabling multimodal style signals through contrastive learning. It augments diffusion-based generation with a Motion-aligned Temporal VAE (MaTLD) to better preserve temporal dynamics in the motion latent space. The approach achieves superior results across multiple datasets, outperforming prior stylized motion methods and enabling style control from motions, text, or images. This framework offers a flexible, scalable solution for multimodal, content-aware motion stylization with broad applicability in animation, AR/VR, and robotics.

Abstract

Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.

MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow

TL;DR

MulSMo tackles the challenge of stylized motion generation by introducing a bidirectional control flow between the style and content networks and enabling multimodal style signals through contrastive learning. It augments diffusion-based generation with a Motion-aligned Temporal VAE (MaTLD) to better preserve temporal dynamics in the motion latent space. The approach achieves superior results across multiple datasets, outperforming prior stylized motion methods and enabling style control from motions, text, or images. This framework offers a flexible, scalable solution for multimodal, content-aware motion stylization with broad applicability in animation, AR/VR, and robotics.

Abstract

Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.

Paper Structure

This paper contains 31 sections, 9 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: MulSMo enables multimodal signals to control the stylized motion generation.
  • Figure 2: Overview of MulSMo. Our approach can take in various modalities as style signals, such as style motion sequences, texts, and images. We generate stylized human motions by combining content text $\mathbf{c}$ with style signals $\mathbf{s}$. In the encoder part of the latent denoiser, we establish a bidirectional control flow between the generation network and the style network, utilizing respective zero linear for fusion. (Note that for simplicity we plot the style-to-content zero linear and content-to-style zero linear into one block in the figure.) This denoising step is repeated $T$ times to obtain the noise-free motion latent ${\bm{z}}_0$, which is decoded to stylized motion by the latent decoder.
  • Figure 3: Motion-aligned Temporal VAE.
  • Figure 4: Architectures with different control flow between generation network and style network. Based on experimental validation, we select (b) as the final architecture.
  • Figure 5: Contrastive learning for enabling multi-modality stylized motion control. The adaptor can partially alleviate the gap between style motions and images/texts.
  • ...and 7 more figures