Table of Contents
Fetching ...

PersonaBooth: Personalized Text-to-Motion Generation

Boeun Kim, Hea In Jeong, JungHoon Sung, Yihua Cheng, Jeongmin Lee, Ju Yong Chang, Sang-Il Choi, Younggeun Choi, Saim Shin, Jungho Kim, Hyung Jin Chang

TL;DR

This work introduces Motion Personalization, a task to generate text-driven motions that faithfully reflect an individual’s persona from a few atomic motions. It presents PerMo, a large-scale persona-labeled motion dataset, and PersonaBooth, a multi-modal finetuning framework that integrates persona cues via a Persona Extractor, a Personalized Text Encoder, and a Context-Aware Fusion module, trained with a diffusion objective and a persona cohesion loss. The method achieves state-of-the-art results on PerMo and 100Style, demonstrating improved FID, text-motion alignment, diversity, and persona-consistency, while enabling robust multi-input fusion. The work advances realistic avatar motion in virtual environments and provides a benchmark for evaluating motion personalization with multi-modal adaptation and contrastive persona learning.

Abstract

This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.

PersonaBooth: Personalized Text-to-Motion Generation

TL;DR

This work introduces Motion Personalization, a task to generate text-driven motions that faithfully reflect an individual’s persona from a few atomic motions. It presents PerMo, a large-scale persona-labeled motion dataset, and PersonaBooth, a multi-modal finetuning framework that integrates persona cues via a Persona Extractor, a Personalized Text Encoder, and a Context-Aware Fusion module, trained with a diffusion objective and a persona cohesion loss. The method achieves state-of-the-art results on PerMo and 100Style, demonstrating improved FID, text-motion alignment, diversity, and persona-consistency, while enabling robust multi-input fusion. The work advances realistic avatar motion in virtual environments and provides a benchmark for evaluating motion personalization with multi-modal adaptation and contrastive persona learning.

Abstract

This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.

Paper Structure

This paper contains 24 sections, 9 equations, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Motion Personalization generates text-driven, personalized motions based on persona embedded in atomic input motions. We propose a new framework, PersonaBooth, along with a new benchmark dataset, PerMo, for Motion Personalization
  • Figure 2: The overall framework of PersonaBooth. PersonaBooth has two adaptation paths—visual and text—for finetuning the Motion Diffusion model ($\mathcal{D}$). The Persona Extractor extracts both a visual persona feature ($V^*$) and a persona token ($P^*$) from the input motions. $V^*$ is input into the adaptive layer of $\mathcal{D}$, while $P^*$ is processed together with the input prompt through a Personalized Text Encoder, generating a personalized text feature, which is then input to $\mathcal{D}$. The entire model is trained with a classifier-free approach, incorporating a Persona Cohesion Loss. During inference, Context-Aware Fusion is applied for multiple input cases.
  • Figure 3: Textual and visual adaptation. (a) Personalized Text Encoder, $\mathcal{X}$. (b) $t$-th step of the Motion Diffusion, $\mathcal{D}$. Learnable parameters are denoted by the fire icon
  • Figure 4: (a) Motion capture studio and examples of data formats: skeleton, markers, and mesh. (b) Unique persona expressions of each actor in the 'Childish' category. (c) Rendered mesh for each actor in the 'Fearful' category
  • Figure 5: Example of the ablation study. The input motions are from the 'Uppity' of Actor 1. The input prompt is "A person walks in a circle." In (a) and (b), only $M_1$ is provided for the input, while both $M_1$ and $M_2$ are provided for (c) and (d). $L_{pc}$ encourages the generated motion to closely follow the prompt, while CAF prevents the motion from blending. We set $k=1$ for CAF
  • ...and 14 more figures