Table of Contents
Fetching ...

Exploring Motion-Language Alignment for Text-driven Motion Generation

Ruxi Gu, Zilei Wang, Wei Wang

Abstract

Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.

Exploring Motion-Language Alignment for Text-driven Motion Generation

Abstract

Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.

Paper Structure

This paper contains 15 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Failure cases from previous text-to-motion generation framework meng2025absolute, which captures global motion patterns but often overlooks fine-grained motion details. In these figures, the color gradient from dark to light represents the temporal progression of motion from earlier to later stages.
  • Figure 2: Overview of our MLA-Gen framework. It comprises three complementary components: Memory Slots for capturing global motion priors, Motion-Language Alignment for providing fine-grained textual semantics, and a SinkRatio-based mechanism that models and mitigates the attention sink phenomenon during both attention computation (sink-mask) and sampling (sink-ctrl).
  • Figure 3: Heatmap of the memory slots activation. Regions rendered in brighter yellow indicate higher attention weights between the corresponding motion frames and memory slots.
  • Figure 4: Heatmap of motion-language alignment. Regions rendered in brighter yellow indicate higher attention weights between the corresponding motion frames and text tokens.
  • Figure 5: Heatmaps comparison of alignment on the masked model (left) and the unmasked model (right). The textual descriptions and timesteps are kept consistent.
  • ...and 3 more figures