Table of Contents
Fetching ...

MoST: Motion Style Transformer between Diverse Action Contents

Boeun Kim, Jungho Kim, Hyung Jin Chang, Jin Young Choi

TL;DR

MoST introduces a transformer-based framework to transfer style between motion sequences with different contents by explicitly disentangling style and content. It uses Siamese encoders to extract $S^C,Y^C$ and $S^S,Y^S$, a Part-Attentive Style Modulator to align $S^S$ with $C^C$ via cross-attention, and a motion generator conditioned by AdaIN on $ ilde{S}^S$, producing $M^G$ from $M^C$ and $M^S$. A novel style disentanglement loss $L_D$ and a physics-based loss $L_{phy}$ improve robustness and physical plausibility, enabling high-quality results without post-processing on datasets Xia and BFA. Compared to state-of-the-art methods, MoST excels in cross-content transfers, achieving lower CC and SC++ errors and producing coherent global translation and pose dynamics. The approach has practical impact for animation and game pipelines by delivering believable stylized motions across diverse actions.

Abstract

While existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge, we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with `part-attentive style modulator across body parts' and `Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality, particularly in motion pairs with different contents, without the need for heuristic post-processing. Codes are available at https://github.com/Boeun-Kim/MoST.

MoST: Motion Style Transformer between Diverse Action Contents

TL;DR

MoST introduces a transformer-based framework to transfer style between motion sequences with different contents by explicitly disentangling style and content. It uses Siamese encoders to extract and , a Part-Attentive Style Modulator to align with via cross-attention, and a motion generator conditioned by AdaIN on , producing from and . A novel style disentanglement loss and a physics-based loss improve robustness and physical plausibility, enabling high-quality results without post-processing on datasets Xia and BFA. Compared to state-of-the-art methods, MoST excels in cross-content transfers, achieving lower CC and SC++ errors and producing coherent global translation and pose dynamics. The approach has practical impact for animation and game pipelines by delivering believable stylized motions across diverse actions.

Abstract

While existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge, we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with `part-attentive style modulator across body parts' and `Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality, particularly in motion pairs with different contents, without the need for heuristic post-processing. Codes are available at https://github.com/Boeun-Kim/MoST.
Paper Structure (25 sections, 23 equations, 22 figures, 4 tables)

This paper contains 25 sections, 23 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Frequent failure cases in existing methods: (a) A result of MotionPuzzle jang2022motion replicating style motion. (b) A result of Aberman et al. aberman2020unpaired showing complete failure with twisted motion. The character for the visualization is sourced from Mixamo mixamo.
  • Figure 2: (a) Overall framework of MoST comprising Siamese motion encoders $\mathcal{E}$, motion generator $\mathcal{G}$, and part-attentive style modulator (PSM). PSM modulates style feature $S^S$ under the condition of both contents of content motion and style motion, i.e., $C^C$ and $C^S$. $\mathcal{G}$ generates final output motion with content dynamics feature $Y^C$ and the modulated style feature $\tilde{S}^S$. (b) Detailed operations in PSM
  • Figure 3: Description of evaluation metrics, using easy-to-recognize label notations. Note that our model uses only motion data
  • Figure 4: Qualitative results in Xia xia2015realtime and BFA aberman2020unpaired datasets. Please refer to the red indications. (1) Our method better reflects the style of old in comparison to other existing methods, accurately representing both the bent upper body and leg. (2) Other methods fail to preserve the content of punch, instead, they result in peculiar leg movements or body twists. On the other hand, our result accurately depicts strutting punch, where the upper body leans backward. (3) The results of aberman2020unpaired and jang2022motion do not exhibit a kick, instead, their arm moves. park2021diverse yields twisted leg movements. (4) Unlike our method, others fail to preserve the content of punch, resulting in vibrations in static poses or twists
  • Figure 5: (a-b) Visualization of the modulated style feature ($\tilde{S}^S$) space of MoST in different loss settings. $L_{pre}$ and $L_{phy}$ are applied by default in (a). $L_{D}$ is additionally introduced in (b). All training and testing data are used as style motion, and a single data point in the test set is employed for content motion. The spaces are projected in 2D through t-SNE. The samples are visualized with different shapes according to their content labels and different colors according to their style labels. (c) Space of $S^S$ before PSM. All loss functions are applied
  • ...and 17 more figures