Table of Contents
Fetching ...

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi, Tomáš Rebok

TL;DR

This work tackles text-to-motion and motion-to-text retrieval under data scarcity by proposing joint-dataset learning and a Cross-Consistent Contrastive Loss (CCCL) to regularize a shared cross-modal space. It introduces MoT++, a transformer-based motion encoder with spatio-temporal attention and structured joint tokens, and uses a variational-like latent space to align text and motion modalities. Through extensive experiments on KITML and HumanML3D, the approach achieves state-of-the-art performance in both single-dataset and cross-dataset settings, demonstrating improved generalization and robustness. The contributions offer practical improvements for scalable retrieval in skeleton-based motion datasets and point to broader applicability in low-data cross-modal retrieval tasks, with potential extensions to additional modalities and pairwise learning.

Abstract

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process sequences of skeleton data. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods.

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

TL;DR

This work tackles text-to-motion and motion-to-text retrieval under data scarcity by proposing joint-dataset learning and a Cross-Consistent Contrastive Loss (CCCL) to regularize a shared cross-modal space. It introduces MoT++, a transformer-based motion encoder with spatio-temporal attention and structured joint tokens, and uses a variational-like latent space to align text and motion modalities. Through extensive experiments on KITML and HumanML3D, the approach achieves state-of-the-art performance in both single-dataset and cross-dataset settings, demonstrating improved generalization and robustness. The contributions offer practical improvements for scalable retrieval in skeleton-based motion datasets and point to broader applicability in low-data cross-modal retrieval tasks, with potential extensions to additional modalities and pairwise learning.

Abstract

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process sequences of skeleton data. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods.
Paper Structure (23 sections, 10 equations, 6 figures, 6 tables)

This paper contains 23 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Formulation of the tasks: text-to-motion retrieval (left) and motion-to-text retrieval (right).
  • Figure 2: Schematic illustration of the whole architecture. In the training phase, (1) joint-dataset learning is applied to unified HumanML3D and KITML datasets to learn the common space of both the text and motion modalities and (2) the trained motion encoder is then used to extract motions' embeddings that are stored in a database (exemplified on the KITML dataset). In the retrieval phase, the embedding of a given text query is extracted and compared against the embeddings of motions in the KITML database to retrieve the $K$ most relevant motions.
  • Figure 3: MoT++ architecture. The input spatio-temporal skeleton sequence $\bar{\mathbf{x}}$ is processed by the $\mathcal{J}$ function to spatially group the skeleton joints and therefore reduce the spatial sequence length while increasing the dimensionality $D$ to $D'$. The resulting sequence $\mathbf{x}$, concatenated to two special tokens, is then processed by a spatio-temporal transformer -- in this case configured in a factorized self-attention setup. The two CLS tokens in output are employed as $\mathbf{m}^\mu$ and $\mathbf{m}^{\sigma^2}$.
  • Figure 4: Distributions of ranks ($x$-axis) of relevant objects retrieved for (a) text-to-motion and (b) motion-to-text scenarios using joint-dataset learning: training on KITML+HumanML3D, testing on HumanML3D.
  • Figure 5: Qualitative examples on text-to-motion retrieval. Samples (a), (b) show success cases in which MoT++ can find the GT text better describing the motion within the first six results (highlighted using a green border), while TMR finds it at higher ranks. Sample (c), instead, shows a failure scenario in which MoT++ cannot find the GT motion because it misses an important discriminative attribute (stiffly in this case).
  • ...and 1 more figures