ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation
Kerui Chen, Jianrong Zhang, Ming Li, Zhonglong Zheng, Hehe Fan
TL;DR
ClusterStyle tackles intra-style diversity in stylized motion generation by representing each style with multiple non-learnable prototypes, capturing sub-styles at global and local levels. It introduces prototype-based clustering (with Sinkhorn assignment and two contrastive losses), hierarchical style modeling, and a Stylistic Modulation Adapter to fuse style into a pretrained diffusion-based text-to-motion model. The approach achieves state-of-the-art results on stylized motion generation and motion style transfer across HumanML3D and 100STYLE, with extensive ablations confirming the value of global/local prototypes and the proposed losses. This framework yields diverse, content-faithful motions and provides interpretable sub-style patterns that enhance controllability and realism.
Abstract
Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
