LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation
Heechang Kim, Gwanghyun Kim, Se Young Chun
TL;DR
LaMoGen addresses the challenge of fine-grained expressive control in text-to-motion generation by integrating Laban Movement Analysis (LMA) with diffusion-based synthesis. It introduces a zero-shot, inference-time guidance mechanism that differentiably models LMA features and updates the text embedding during DDIM sampling to align generated motions with target Laban Effort and Shape tags, while preserving the motion's identity. The method comprises differentiable Laban feature extraction, a two-step generation pipeline (baseline then Laban-guided refinement), and a relative Laban loss that steers conditioning during sampling. Quantitative and qualitative results show improved controllability and disentanglement of expressive attributes, with only modest trade-offs in text-motion alignment, demonstrating practical potential for expressive motion synthesis without additional training data.
Abstract
Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.
