MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow
Zhe Li, Yisheng He, Lei Zhong, Weichao Shen, Qi Zuo, Lingteng Qiu, Zilong Dong, Laurence Tianruo Yang, Weihao Yuan
TL;DR
MulSMo tackles the challenge of stylized motion generation by introducing a bidirectional control flow between the style and content networks and enabling multimodal style signals through contrastive learning. It augments diffusion-based generation with a Motion-aligned Temporal VAE (MaTLD) to better preserve temporal dynamics in the motion latent space. The approach achieves superior results across multiple datasets, outperforming prior stylized motion methods and enabling style control from motions, text, or images. This framework offers a flexible, scalable solution for multimodal, content-aware motion stylization with broad applicability in animation, AR/VR, and robotics.
Abstract
Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.
