Table of Contents
Fetching ...

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Kent Fujiwara, Mikihiro Tanaka, Qing Yu

TL;DR

This work identifies a critical gap in motion-language models: insufficient understanding of event chronology in temporal sequences. It introduces Chronologically Accurate Retrieval (CAR) to measure a model's ability to distinguish chronologically correct from shuffled-event descriptions, revealing broad temporal gaps in state-of-the-art systems. The authors propose a simple yet effective training strategy that uses chronologically shuffled descriptions as hard negatives within a contrastive learning framework, leading to substantial improvements in both text-to-motion retrieval and motion generation from textual prompts. Across multiple language encoders and generation baselines, this approach strengthens temporal alignment between language and motion, with practical benefits for handling complex compound actions in real-world datasets like HumanML3D. The findings highlight the need to incorporate temporal structure into cross-modal training to achieve robust, temporally coherent motion-language representations.

Abstract

With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment.

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

TL;DR

This work identifies a critical gap in motion-language models: insufficient understanding of event chronology in temporal sequences. It introduces Chronologically Accurate Retrieval (CAR) to measure a model's ability to distinguish chronologically correct from shuffled-event descriptions, revealing broad temporal gaps in state-of-the-art systems. The authors propose a simple yet effective training strategy that uses chronologically shuffled descriptions as hard negatives within a contrastive learning framework, leading to substantial improvements in both text-to-motion retrieval and motion generation from textual prompts. Across multiple language encoders and generation baselines, this approach strengthens temporal alignment between language and motion, with practical benefits for handling complex compound actions in real-world datasets like HumanML3D. The findings highlight the need to incorporate temporal structure into cross-modal training to achieve robust, temporally coherent motion-language representations.

Abstract

With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment.
Paper Structure (31 sections, 5 equations, 7 figures, 14 tables)

This paper contains 31 sections, 5 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Overview of Chronologically Accurate Retrieval test. Given a motion sequence, motion-language models trained on text-motion datasets are asked to retrieve the more relevant text from the ground truth and its shuffled version. Original texts are decomposed into events by off-the-shelf Large Language Models, which are randomly shuffled.
  • Figure 2:
  • Figure 3:
  • Figure 5: Overview of the proposed contrastive learning scheme with chronological negative samples. We use the texts derived from shuffling the event order and employ them as negative text samples, corresponding to items indicated in pink.
  • Figure 6: Comparison of retrieval results with corrupted texts using TMR and the proposed training scheme. Pink texts indicate the successfully retrieved ground truth text.
  • ...and 2 more figures