CASIM: Composite Aware Semantic Injection for Text to Motion Generation
Che-Jui Chang, Qingze Tony Liu, Honglu Zhou, Vladimir Pavlovic, Mubbasir Kapadia
TL;DR
This work tackles the challenge of conditioning text-to-motion generation with composite, temporally structured prompts by moving beyond fixed-length CLIP embeddings. It introduces CASIM, a Composite Aware Semantic Injection Mechanism comprising a Composite Aware Text Encoder and a Text-Motion Aligner that learns dynamic, token-level alignments between text tokens and motion frames, and is compatible with both autoregressive and diffusion-based models. Empirically, CASIM yields consistent gains in text-motion matching metrics and competitive motion quality across HumanML3D and KIT benchmarks, across multiple SOTA baselines, and shows promise for long-term motion generation. The approach offers interpretable attention patterns and stronger generalization to unseen prompts, enabling more precise, controllable text-driven motion synthesis with broad applicability.
Abstract
Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.
