SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing
Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Junyong Noh
TL;DR
SALAD addresses the limitations of crude pose and word representations in text-to-motion by introducing a skeleton-aware latent diffusion framework. It employs a skeleto-temporal VAE to map motions into a compact latent space $\mathbf{z} \in \mathbb{R}^{N' \times J' \times D}$ with $J'=7$, followed by a diffusion denoiser that uses TempAttn, SkelAttn, and CrossAttn to capture frame-level, joint-level, and text interactions; cross-attention enables zero-shot editing via velocity $v_t$ modulation and classifier-free guidance. The approach achieves superior text-motion alignment on two benchmarks (e.g., SALAD yields the best or near-best R-precision and strong FID) while supporting editing without extra optimization through four attention-modulation operators. The work provides interpretable representations for motion synthesis and practical editing capabilities with a pre-trained model, potentially impacting animation pipelines and downstream understanding of text-to-motion mappings. Overall, SALAD advances the state of text-driven motion generation and zero-shot editing by explicitly modeling skeleton-temporal-text relationships and enabling training-free motion edits through attention manipulation.
Abstract
Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.
