A Cross-Dataset Study for Text-based 3D Human Motion Retrieval

Léore Bensabath; Mathis Petrovich; Gül Varol

A Cross-Dataset Study for Text-based 3D Human Motion Retrieval

Léore Bensabath, Mathis Petrovich, Gül Varol

TL;DR

This work analyzes cross-dataset generalization in text-based 3D human motion retrieval using a unified SMPL representation to enable training across HumanML3D, KITML, and BABEL. It extends the TMR framework with text augmentations via paraphrasing and action-style prompts, plus a hard-negative contrastive loss, and studies multi-dataset training. The results show dataset biases persist across benchmarks, text augmentation reduces the domain gap but does not fully close it, and zero-shot action recognition on BABEL improves substantially when trained with augmented HumanML3D text. The findings highlight the potential and limitations of language-driven robustness in 3D motion retrieval and suggest directions for grounded, motion-aware augmentation and broader cross-domain analyses.

Abstract

We provide results of our study on text-based 3D human motion retrieval and particularly focus on cross-dataset generalization. Due to practical reasons such as dataset-specific human body representations, existing works typically benchmarkby training and testing on partitions from the same dataset. Here, we employ a unified SMPL body format for all datasets, which allows us to perform training on one dataset, testing on the other, as well as training on a combination of datasets. Our results suggest that there exist dataset biases in standard text-motion benchmarks such as HumanML3D, KIT Motion-Language, and BABEL. We show that text augmentations help close the domain gap to some extent, but the gap remains. We further provide the first zero-shot action recognition results on BABEL, without using categorical action labels during training, opening up a new avenue for future research.

A Cross-Dataset Study for Text-based 3D Human Motion Retrieval

TL;DR

Abstract

Paper Structure (19 sections, 6 figures, 5 tables)

This paper contains 19 sections, 6 figures, 5 tables.

Introduction
Related Work
3D human motions and language.
Zero-shot classification with natural language supervision.
Methodology
Experiments
Datasets
Evaluation protocol
Text-to-motion retrieval results
Cross-dataset evaluations.
Combining datasets.
Text augmentation.
HN-NCE.
Zero-shot action recognition results
Text augmentation ablations
...and 4 more sections

Figures (6)

Figure 1: 3D human motion descriptions per dataset: The t-SNE plot of text embeddings corresponding to motion descriptions clearly shows a domain gap between the concise raw labels of the BABEL dataset and the full-sentence labels of HumanML3D and KITML datasets.
Figure 2: Model overview: We simply employ TMR petrovich23tmr for text-motion retrieval, but unify several text augmentation approaches to increase its robustness across domains. For each ground truth (GT) textual label, we generate $n$ paraphrased versions, as well as a short action-style description using Llama-2 prompting. During training, we randomly sample either of these augmented labels with probabilities defined by $p_{gt},p_{par}, p_{avg}, p_{act}$. With probability $p_{avg}$, we also randomly subsample from all versions and average their text embeddings. The selected text embedding $z^T$ is then matched to the motion embedding $z^M$ using contrastive loss. Note that we do not visualize the motion decoder for simplicity, but we keep the original architecture as in petrovich23tmr.
Figure 3: Qualitative results on HumanML3D text-to-motion retrieval with and without augmentation: In both examples, while none of the retrieved motions are extremely remote from the text description, the model trained with augmentation captures more of the requested details for most motions in the top 5 ranks. In the example above, the model captures the interaction between elbow and knee, while the baseline model only captures the implication of the legs. In the below example, the model retrieves both parts of the movement -- putting the box down and running -- while the baseline only retrieves the running portion.
Figure 4: Qualitative results on BABEL action recognition: We apply zero-shot action classification via motion-to-text retrieval by treating class labels as text. The model is trained on HumanML3D free-form textual labels, and tested on BABEL actions. On the right of each input motion example, we display the ground truth (GT) action, along with the top-5 retrieved actions and their motion-text similarity scores. We observe that the high similarities among the top retrieved actions are mainly due to ambiguities across categories, e.g., "Grasp object" motion retrieves action classes involving hand motions such as "Touch object" and "Hand movements".
Figure 5: Per-action performance improvement: We plot the per-action R@1 scores for the 60 BABEL actions, comparing with/without the text augmentations. The dashed line represents the frequency of test labels for each class (y-axis on the right), showing the unbalanced nature of this benchmark.
...and 1 more figures

A Cross-Dataset Study for Text-based 3D Human Motion Retrieval

TL;DR

Abstract

A Cross-Dataset Study for Text-based 3D Human Motion Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)