Generative Human Motion Stylization in Latent Space

Chuan Guo; Yuxuan Mu; Xinxin Zuo; Peng Dai; Youliang Yan; Juwei Lu; Li Cheng

Generative Human Motion Stylization in Latent Space

Chuan Guo, Yuxuan Mu, Xinxin Zuo, Peng Dai, Youliang Yan, Juwei Lu, Li Cheng

TL;DR

This work introduces a generative motion stylization framework that operates in the latent space of a pretrained autoencoder. It decomposes motion codes into a deterministic content component and a probabilistic style component drawn from a prior, with a generator that recombines them to produce diverse stylizations conditioned on style cues or sampled priors. Key contributions include a probabilistic style space learned from latent codes, homo-style alignment to cohere styles within a sequence, and a lightweight global motion predictor to preserve realistic pacing. Extensive experiments on three motion datasets show improved style fidelity, content preservation, and generalization, along with efficiency gains and rich qualitative results. The approach offers flexible conditioning (style motion, style label, or prior sampling) and supports zero-shot stylization, making it practical for animation pipelines and text-to-motion integration.

Abstract

Human motion stylization aims to revise the style of an input motion while keeping its content unaltered. Unlike existing works that operate directly in pose space, we leverage the latent space of pretrained autoencoders as a more expressive and robust representation for motion extraction and infusion. Building upon this, we present a novel generative model that produces diverse stylization results of a single motion (latent) code. During training, a motion code is decomposed into two coding components: a deterministic content code, and a probabilistic style code adhering to a prior distribution; then a generator massages the random combination of content and style codes to reconstruct the corresponding motion codes. Our approach is versatile, allowing the learning of probabilistic style space from either style labeled or unlabeled motions, providing notable flexibility in stylization as well. In inference, users can opt to stylize a motion using style cues from a reference motion or a label. Even in the absence of explicit style input, our model facilitates novel re-stylization by sampling from the unconditional style prior distribution. Experimental results show that our proposed stylization models, despite their lightweight design, outperform the state-of-the-art in style reenactment, content preservation, and generalization across various applications and settings. Project Page: https://murrol.github.io/GenMoStyle

Generative Human Motion Stylization in Latent Space

TL;DR

Abstract

Paper Structure (47 sections, 4 equations, 13 figures, 12 tables)

This paper contains 47 sections, 4 equations, 13 figures, 12 tables.

Introduction
Related Work
Image Style Transfer.
Motion Style Transfer.
Synthesis in Latent.
Generative Motion Stylization
Motion Latent Representation
Motion Latent Stylization Framework
Model Architecture.
Learning Scheme
AutoEncoding $\mathcal{L}_{rec}$.
Homo-style Alignment $\mathcal{L}_{hsa}$.
Swap and Cycle Reconstruction $\mathcal{L}_{cyc}$.
Unsupervised Scheme (w/o Style Label).
Difference of $\mathcal{N}_s$ Learned w and w/o Style Label.
...and 32 more sections

Figures (13)

Figure 1: (Top) Given an input motion and target style label (i.e., old), our label-based stylization generates diverse results following provided label. (Bottom) Without any style indicators, our prior-based method randomly re-stylizes the input motion using sampled prior styles $\mathbf{z}_s$. Five distinct stylized motions from the same content are presented, with poses synchronized and history in gray. See \ref{['fig:inference']} (b) and (d) for implementations.
Figure 2: Approach overview. (a) A pre-trained autoencoder $\mathcal{E}$ and $\mathcal{D}$ (\ref{['subsec:motion_latent']}) builds the mappings between motion and latent spaces. Motion (latent) code $\mathbf{z}$ is further encoded into two parts: content code $\mathbf{z}_c$ from content encoder ($\mathrm{E}_c$), and style space $\mathcal{N}_s$ from style encoder ($\mathrm{E}_s$) that take style label $sl$ as an additional input. The content code ($\mathbf{z}_c$) is decoded back to motion code ($\mathbf{\hat{z}}$) via generator $\mathrm{G}$. Meanwhile, a style code $\mathbf{z}_s$ is sampled from style space ($\mathcal{N}_s$), together with style label ($sl$), which are subsequently injected to generator layers through adaptive instance normalization (AdaIN). (b) Learning scheme, where style label ($sl$) is omitted for simplicity. Our model is trained by autoencoding for content and style coming from the same input. When decoding with content from different input (i.e., swap), we enforce the resulting motion code ($\mathbf{\hat{z}}^t$) to follow the cycle reconstruction constraint. For motion codes ($\mathbf{z}^1$, $\mathbf{z}^2$) segmented from the same sequence (homo-style), their style spaces are assumed to be close and learned with style alignment loss $\mathcal{L}_{hsa}$.
Figure 3: During inference, our approach can stylize input content motions with the style cues from (a, c) motion, (b) style label and (d) unconditional style prior space.
Figure 4: Qualitative comparisons of motion-based stylization. Given the style motion (green) and content motion (blue), we apply stylization using our methods (orange), park2021diverse (supervised), and jang2022motion (unsupervised). The content motions in top two cases come from aberman2020unpaired, while the bottom two from CMU Mocap cmu2021mocap test sets. Example artifacts are highlighted using red signs. More results are provided in supplementary videos.
Figure 5: Two examples of diverse label-based stylization (middle) and prior-based stylization (right).
...and 8 more figures

Generative Human Motion Stylization in Latent Space

TL;DR

Abstract

Generative Human Motion Stylization in Latent Space

Authors

TL;DR

Abstract

Table of Contents

Figures (13)