Table of Contents
Fetching ...

Generative Motion Stylization of Cross-structure Characters within Canonical Motion Space

Jiaxu Zhang, Xin Chen, Gang Yu, Zhigang Tu

TL;DR

MotionS tackles stylized motion generation for characters with diverse skeletal topologies by learning a canonical motion space that can be stylized with cross-modality prompts. It introduces cross-modality style embedding to align style descriptions with a CLIP latent space and uses topology-encoded tokens to enable cross-structure topology shifting. A topology-shifted stylization diffusion model synthesizes content-consistent, stylized motion from a single sequence, achieving both diversity and style fidelity. Experiments show strong qualitative and quantitative gains over baselines and demonstrate zero-shot generalization to unseen prompts, highlighting its potential for flexible, large-scale animation workflows.

Abstract

Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. Our key insight is to embed motion style into a cross-modality latent space and perceive the cross-structure skeleton topologies, allowing for motion stylization within a canonical motion space. Specifically, the large-scale Contrastive-Language-Image-Pre-training (CLIP) model is leveraged to construct the cross-modality latent space, enabling flexible style representation within it. Additionally, two topology-encoded tokens are learned to capture the canonical and specific skeleton topologies, facilitating cross-structure topology shifting. Subsequently, the topology-shifted stylization diffusion is designed to generate motion content for the particular skeleton and stylize it in the shifted canonical motion space using multi-modality style descriptions. Through an extensive set of examples, we demonstrate the flexibility and generalizability of our pipeline across various characters and style descriptions. Qualitative and quantitative comparisons show the superiority of our pipeline over state-of-the-arts, consistently delivering high-quality stylized motion across a broad spectrum of skeletal structures.

Generative Motion Stylization of Cross-structure Characters within Canonical Motion Space

TL;DR

MotionS tackles stylized motion generation for characters with diverse skeletal topologies by learning a canonical motion space that can be stylized with cross-modality prompts. It introduces cross-modality style embedding to align style descriptions with a CLIP latent space and uses topology-encoded tokens to enable cross-structure topology shifting. A topology-shifted stylization diffusion model synthesizes content-consistent, stylized motion from a single sequence, achieving both diversity and style fidelity. Experiments show strong qualitative and quantitative gains over baselines and demonstrate zero-shot generalization to unseen prompts, highlighting its potential for flexible, large-scale animation workflows.

Abstract

Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality style prompts. Our key insight is to embed motion style into a cross-modality latent space and perceive the cross-structure skeleton topologies, allowing for motion stylization within a canonical motion space. Specifically, the large-scale Contrastive-Language-Image-Pre-training (CLIP) model is leveraged to construct the cross-modality latent space, enabling flexible style representation within it. Additionally, two topology-encoded tokens are learned to capture the canonical and specific skeleton topologies, facilitating cross-structure topology shifting. Subsequently, the topology-shifted stylization diffusion is designed to generate motion content for the particular skeleton and stylize it in the shifted canonical motion space using multi-modality style descriptions. Through an extensive set of examples, we demonstrate the flexibility and generalizability of our pipeline across various characters and style descriptions. Qualitative and quantitative comparisons show the superiority of our pipeline over state-of-the-arts, consistently delivering high-quality stylized motion across a broad spectrum of skeletal structures.
Paper Structure (11 sections, 11 equations, 11 figures)

This paper contains 11 sections, 11 equations, 11 figures.

Figures (11)

  • Figure 1: An overview of the Motion$\mathbb{S}$ pipeline. Motion$\mathbb{S}$ takes multi-modality prompts $p$ as style descriptions, generates diverse motion ${\hat{\bm{x}}}_{0}^{p}$ for specific skeletal structures through the diffusion denoising process, and performs the motion stylization in a canonical motion space $\Omega$.
  • Figure 2: The structure and training strategy of the cross-modality style embedding. We design a motion encoder to embed the motion $\bm{m}$ of the standard SMPL skeleton and align its latent space with the CLIP encoders. The motion encoder and the CLIP encoders constitute the cross-modality style encoder in Motion$\mathbb{S}$. Furthermore, the canonical motion space is constructed through the canonical decoder and the learnable canonical TET.
  • Figure 3: Illustration of the model details. The left part is the structure and training strategy of our topology-shifted stylization diffusion (TSD) model, which generates the stylized motion across specific and canonical motion spaces. The right part is the illustration of the cross-structure topology shifting.
  • Figure 4: Quantitative comparison on the SMPL source motion and the xia2015realtime source motion. The best results are in bold, and the second best are underlined. Note that a balance of good scores across all metrics is better than excelling in just a few.
  • Figure 5: Qualitative comparison with SinMDM. SinMDM$^{*}$ refers to the model trained on the source motion instead of the style motion. SinMDM exhibits unstable performance in expressing both the motion content and style.
  • ...and 6 more figures