Table of Contents
Fetching ...

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao

TL;DR

This work proposes an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers, and employs MaxSim, a token-wise late interaction mechanism, and enhances it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment.

Abstract

Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

TL;DR

This work proposes an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers, and employs MaxSim, a token-wise late interaction mechanism, and enhances it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment.

Abstract

Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
Paper Structure (28 sections, 9 equations, 4 figures, 4 tables)

This paper contains 28 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the three-stage training pipeline.
  • Figure 2: Joint-angle vs. joint-position-based representations for "a person walks slowly forward". (a) Skeletal structure and body-centric axes. (b) Right hip angles. (c) Right knee positions. (d) Our 29-dimension joint angle Motion Image: each band encodes a distinct joint.(e) MoPatch yu2024exploring position image.
  • Figure 3: Qualitative T2M retrieval top-3 results on HumanML3D. Correct retrievals (ground-truth match) are highlighted in green.
  • Figure 4: MaxSim interaction score maps for two text-motion pairs. Left: 3D motion. Middle: normalized Motion Image. Right: interaction score map (brighter = stronger alignment).