Table of Contents
Fetching ...

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Junyong Noh

TL;DR

SALAD addresses the limitations of crude pose and word representations in text-to-motion by introducing a skeleton-aware latent diffusion framework. It employs a skeleto-temporal VAE to map motions into a compact latent space $\mathbf{z} \in \mathbb{R}^{N' \times J' \times D}$ with $J'=7$, followed by a diffusion denoiser that uses TempAttn, SkelAttn, and CrossAttn to capture frame-level, joint-level, and text interactions; cross-attention enables zero-shot editing via velocity $v_t$ modulation and classifier-free guidance. The approach achieves superior text-motion alignment on two benchmarks (e.g., SALAD yields the best or near-best R-precision and strong FID) while supporting editing without extra optimization through four attention-modulation operators. The work provides interpretable representations for motion synthesis and practical editing capabilities with a pre-trained model, potentially impacting animation pipelines and downstream understanding of text-to-motion mappings. Overall, SALAD advances the state of text-driven motion generation and zero-shot editing by explicitly modeling skeleton-temporal-text relationships and enabling training-free motion edits through attention manipulation.

Abstract

Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

TL;DR

SALAD addresses the limitations of crude pose and word representations in text-to-motion by introducing a skeleton-aware latent diffusion framework. It employs a skeleto-temporal VAE to map motions into a compact latent space with , followed by a diffusion denoiser that uses TempAttn, SkelAttn, and CrossAttn to capture frame-level, joint-level, and text interactions; cross-attention enables zero-shot editing via velocity modulation and classifier-free guidance. The approach achieves superior text-motion alignment on two benchmarks (e.g., SALAD yields the best or near-best R-precision and strong FID) while supporting editing without extra optimization through four attention-modulation operators. The work provides interpretable representations for motion synthesis and practical editing capabilities with a pre-trained model, potentially impacting animation pipelines and downstream understanding of text-to-motion mappings. Overall, SALAD advances the state of text-driven motion generation and zero-shot editing by explicitly modeling skeleton-temporal-text relationships and enabling training-free motion edits through attention manipulation.

Abstract

Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.

Paper Structure

This paper contains 28 sections, 16 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Architecture of the skeleto-temporal VAE network. The encoder maps motion features into a skeleto-temporal latent space, and the decoder restores the skeleto-temporal latent variables into motion features.
  • Figure 2: (Left) Overall network architecture of the denoiser. (Right) The architecture of each transformer block consisting of TempAttn, SkelAttn, CrossAttn, and FFN, along with the FiLM following each module.
  • Figure 2: Illustration of the skeletal pooling process for the HumanML3D and KIT-ML datasets. The original skeleton (left) is progressively abstracted by pooling adjacent joints (middle and right). The notation $i \leftarrow \{ \}$ indicates the abstracted joint index and the set of original joints that are grouped together. The unpooling layers operate in the reverse order to restore the skeletal resolution.
  • Figure 3: Attention modulation methods applied to the cross-attention maps to enable text-driven motion editing.
  • Figure 3: Additional visualizations of cross-attention maps between text and motion. Each row corresponds to a specific body part, and each column represents temporal frames.
  • ...and 4 more figures