Cross-Modal Retrieval for Motion and Text via DropTriple Loss

Sheng Yan; Yang Liu; Haoqiang Wang; Xin Du; Mengyuan Liu; Hong Liu

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

Sheng Yan, Yang Liu, Haoqiang Wang, Xin Du, Mengyuan Liu, Hong Liu

TL;DR

This work tackles cross-modal retrieval between 3D human motion and natural language by introducing a compact dual-unimodal transformer framework and a novel DropTriple Loss. DropTriple Loss prunes false negatives—samples with high intra-modal similarity to positives—before mining genuine hard negatives, addressing semantic conflicts arising from overlapping atomic actions in motion data. Empirical results on HumanML3D and KIT-ML show consistent improvements over SH and MH losses, with notable gains in R@1, R@5, and R@10, and further improvements when fine-tuning the language model. The approach enables more reliable motion-text search and has broad implications for applications such as surveillance and action description retrieval, with potential extension to other cross-modal domains.

Abstract

Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at https://github.com/eanson023/rehamot.

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 10 figures, 3 tables)

This paper contains 16 sections, 7 equations, 10 figures, 3 tables.

Introduction
Method
Task Definition
Model Architecture
Learning Objective
SH Loss & MH Loss
False Negative Sample Definition
DropTriple Loss
Experiments
Datasets, and Evaluation Protocol
Pose Representation
Implementation Details
Results
Ablation Study on Warm-up and Threshold $\delta$
Qualitative Results
...and 1 more sections

Figures (10)

Figure 1: As an example of motion retrieval: Given a textual query (anchor), the retrieval model searches for positive motion sample (green box) in the motion library. Likewise, text retrieval follows a similar procedure.
Figure 2: Our proposed framework encodes and aggregates the motion and text inputs separately in their respective encoders. Finally, the outputs are mapped to the joint embedding space through a projection layer. Within the same training batch, the DropTriple Loss discards mining false-Negs $m_{j}$ and $m_{s}$, while pushing the genuinely hard-Neg $m_{k}$ away.
Figure 3: epoch-1
Figure 4: epoch-6
Figure 5: epoch-11
...and 5 more figures

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

TL;DR

Abstract

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

Authors

TL;DR

Abstract

Table of Contents

Figures (10)