Table of Contents
Fetching ...

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian

TL;DR

LAVIMO presents a tri-modal framework that jointly embeds text, video, and motion into a shared space using cross-modal contrastive learning. By introducing video as an intermediary modality and a novel attention-based fusion in motion reconstruction, it achieves state-of-the-art text-to-motion and video-to-motion retrieval on HumanML3D and KIT-ML. Key contributions include the tri-modal architecture, a negative-filtered alignment objective, and RGB-video augmentation of motion datasets to enrich training data. The approach enables flexible cross-modal retrieval, offers a new video-to-motion retrieval task, and demonstrates generalization to real-life content, advancing practical human-motion understanding and animation pipelines.

Abstract

Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

TL;DR

LAVIMO presents a tri-modal framework that jointly embeds text, video, and motion into a shared space using cross-modal contrastive learning. By introducing video as an intermediary modality and a novel attention-based fusion in motion reconstruction, it achieves state-of-the-art text-to-motion and video-to-motion retrieval on HumanML3D and KIT-ML. Key contributions include the tri-modal architecture, a negative-filtered alignment objective, and RGB-video augmentation of motion datasets to enrich training data. The approach enables flexible cross-modal retrieval, offers a new video-to-motion retrieval task, and demonstrates generalization to real-life content, advancing practical human-motion understanding and animation pipelines.

Abstract

Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.
Paper Structure (12 sections, 9 equations, 3 figures, 4 tables)

This paper contains 12 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of LAVIMO. During the training phase, the three modalities are processed through their distinct encoders. Subsequently, the resultant embeddings are aligned within a unified joint embedding space utilizing contrastive learning techniques. In the inference stage, the model is capable of accepting texts or videos as input queries, enabling the retrieval of corresponding motion data effectively.
  • Figure 2: Overview of Features Fusion module. The embeddings for the text, video, and motion modalities are derived from their respective encoders. Subsequently, the motion embedding acts as a query to retrieve relevant information from the text and video, potentially compensating for any information that may be missing in the motion modality. The output of the attention mechanism is the weighted synthesis of the three modalities, which is then fed to the motion decoder for reconstruction.
  • Figure 3: Qualitative Comparison on the HumanML3D Dataset. Our method successfully performs text-to-motion and video-to-motion retrieval tasks. For text-to-motion retrieval, we compare our results with TMR petrovich23tmr. In the first row, using a random text from the test set, our method accurately retrieves the correct motion at rank 1, with similar motions such as "boxing" at ranks 2 and 3, resembling ''karate type motion". In contrast, TMR struggles, with only its rank 3 motion matching the ground truth. In the second row, when testing with a non-test set text involving "dance", our model retrieves motions suggesting a "Latin dance", more accurate than TMR's less precise dance motions. For video-to-motion retrieval, in the third row, our model excels with test set videos, correctly retrieving ground-truth motions at rank 1 and similar motions at ranks 2 and 3. Furthermore, in the last row, when applied to real-life human-centric videos, our model shows strong generalization, retrieving motions closely matching the video content, such as "leg swinging" and "standing up and walking".