Table of Contents
Fetching ...

CASIM: Composite Aware Semantic Injection for Text to Motion Generation

Che-Jui Chang, Qingze Tony Liu, Honglu Zhou, Vladimir Pavlovic, Mubbasir Kapadia

TL;DR

This work tackles the challenge of conditioning text-to-motion generation with composite, temporally structured prompts by moving beyond fixed-length CLIP embeddings. It introduces CASIM, a Composite Aware Semantic Injection Mechanism comprising a Composite Aware Text Encoder and a Text-Motion Aligner that learns dynamic, token-level alignments between text tokens and motion frames, and is compatible with both autoregressive and diffusion-based models. Empirically, CASIM yields consistent gains in text-motion matching metrics and competitive motion quality across HumanML3D and KIT benchmarks, across multiple SOTA baselines, and shows promise for long-term motion generation. The approach offers interpretable attention patterns and stronger generalization to unseen prompts, enabling more precise, controllable text-driven motion synthesis with broad applicability.

Abstract

Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.

CASIM: Composite Aware Semantic Injection for Text to Motion Generation

TL;DR

This work tackles the challenge of conditioning text-to-motion generation with composite, temporally structured prompts by moving beyond fixed-length CLIP embeddings. It introduces CASIM, a Composite Aware Semantic Injection Mechanism comprising a Composite Aware Text Encoder and a Text-Motion Aligner that learns dynamic, token-level alignments between text tokens and motion frames, and is compatible with both autoregressive and diffusion-based models. Empirically, CASIM yields consistent gains in text-motion matching metrics and competitive motion quality across HumanML3D and KIT benchmarks, across multiple SOTA baselines, and shows promise for long-term motion generation. The approach offers interpretable attention patterns and stronger generalization to unseen prompts, enabling more precise, controllable text-driven motion synthesis with broad applicability.

Abstract

Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (Top) Fixed-length semantic injection, which primarily relied on the [CLS] token embedding from CLIP radford2021learning to represent the entire text prompt, fails to capture the subtle differences in individual words. As a result, it generates highly similar motions from distinct text prompts. (Bottom) Our Composite aware semantic injection method allows each motion frame to dynamically attend to every word token (e.g., "left" or "right" hand), enhancing the motion-text correspondence.
  • Figure 2: CASIM consists of two major components: Composite Aware Text Encoder (Left) for extracting granular word-level embeddings and Text-Motion Aligner (Middle) for aligning motion embeddings with relevant textual embeddings inside a motion generator. The attention score distribution between different motion tokens and the text tokens is visualized on the upper left. The Text-Motion Aligner can be integrated with three genres of motion generation models (Right).
  • Figure 3: Qualitative comparison between two baselines, their CASIM-enhanced models, and ground truth (GT) on HumanML3D test prompts. Action verbs and their modifiers are highlighted in red, with motion sequences shown in color gradients (light to dark) and root trajectories in black. CASIM-MDM and CASIM-T2MGPT generate the motions that better match the descriptions, showing stronger text-motion correspondence and better controllability.
  • Figure 4: Analysis of attention patterns in CASIM. Left: Word cloud showing top-5 attended words across all test prompts, highlighting focus on action verbs, motion modifiers, and spatial references. Right: Word cloud for prompts containing 'walk', revealing attention to motion-specific contextual attributes.
  • Figure 5: Visualization of attention weights in CASIM-MDM. Top: Generated motion sequence for the prompt "a person wave his arms and then sit down". Bottom: Attention heatmaps for four attention heads and their average from the last layer.