Table of Contents
Fetching ...

ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation

Kerui Chen, Jianrong Zhang, Ming Li, Zhonglong Zheng, Hehe Fan

TL;DR

ClusterStyle tackles intra-style diversity in stylized motion generation by representing each style with multiple non-learnable prototypes, capturing sub-styles at global and local levels. It introduces prototype-based clustering (with Sinkhorn assignment and two contrastive losses), hierarchical style modeling, and a Stylistic Modulation Adapter to fuse style into a pretrained diffusion-based text-to-motion model. The approach achieves state-of-the-art results on stylized motion generation and motion style transfer across HumanML3D and 100STYLE, with extensive ablations confirming the value of global/local prototypes and the proposed losses. This framework yields diverse, content-faithful motions and provides interpretable sub-style patterns that enhance controllability and realism.

Abstract

Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.

ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation

TL;DR

ClusterStyle tackles intra-style diversity in stylized motion generation by representing each style with multiple non-learnable prototypes, capturing sub-styles at global and local levels. It introduces prototype-based clustering (with Sinkhorn assignment and two contrastive losses), hierarchical style modeling, and a Stylistic Modulation Adapter to fuse style into a pretrained diffusion-based text-to-motion model. The approach achieves state-of-the-art results on stylized motion generation and motion style transfer across HumanML3D and 100STYLE, with extensive ablations confirming the value of global/local prototypes and the proposed losses. This framework yields diverse, content-faithful motions and provides interpretable sub-style patterns that enhance controllability and realism.

Abstract

Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.

Paper Structure

This paper contains 24 sections, 19 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between SMooDi zhong2024smoodi and ClusterStyle. Given two style motions with the same style but differing in expression, along with a content text, (a) SMooDi generates similar stylized motions. Meanwhile, the generated motions still reflect elements from the style motions' content, leading to inconsistency with the intended content text. (b) In contrast, using different prototypes corresponding to the aeroplane style, our method generates diverse motion results with varying extent and better preserves the semantics of the content text.
  • Figure 2: The overview of the ClusterStyle. Our method consists of a style encoder and a motion latent diffusion model. In the style encoder, we present a cluster-based prototype learning paradigm that represents each style category using a set of non-learnable prototypes (cluster centers) to model the intra-style diversity explicitly. Then, two contrastive losses are proposed for prototype-based intra-style learning and inter-style learning, respectively. To incorporate the style embedding into the diffusion process, we introduce a Style Modulation Adapter (SMA), enabling effective guidance of stylized motion generation.
  • Figure 3: Qualitative results of stylized motion generation. We compare our method with SMooDi under various text prompts and style motion inputs. Our approach demonstrates better content alignment and achieves more accurate and expressive style rendering. For example, the results of SMooDi are inconsistent with the motion trajectories (e.g., "backward", "straight") and action (e.g., "walk") described in the content. More visual comparisons can be found in the website.
  • Figure 4: Qualitative results of motion style transfer. Our approach effectively transfers the target motion style, such as 'Chicken’ or 'Star’, onto the original motion, preserving its structure while adapting its stylistic characteristics.
  • Figure 5: Visualization of prototype guiding. We visualize how global and local prototypes guide the stylization process for diverse generation results under the 'Aeroplane' style. (a) Global prototype; (b) Local prototype.