Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Yao Zhang; Zhuchenyang Liu; Yanlan He; Thomas Ploetz; Yu Xiao

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao

TL;DR

This work proposes an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers, and employs MaxSim, a token-wise late interaction mechanism, and enhances it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment.

Abstract

Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 4 figures, 4 tables)

This paper contains 28 sections, 9 equations, 4 figures, 4 tables.

Introduction
Background and Related Work
Human Motion Representations
Text-Motion Retrieval
Late Interaction in Retrieval
Methodology
Joint-angle-based Motion Representation
Joint angle extraction.
Motion Image construction.
Dual-Stream Architecture
Motion Encoder.
Text Encoder.
Fine-Grained Late Interaction (MaxSim)
Context-Aware Regularization via MLM
Training Strategy and Loss Function
...and 13 more sections

Figures (4)

Figure 1: Overview of the three-stage training pipeline.
Figure 2: Joint-angle vs. joint-position-based representations for "a person walks slowly forward". (a) Skeletal structure and body-centric axes. (b) Right hip angles. (c) Right knee positions. (d) Our 29-dimension joint angle Motion Image: each band encodes a distinct joint.(e) MoPatch yu2024exploring position image.
Figure 3: Qualitative T2M retrieval top-3 results on HumanML3D. Correct retrievals (ground-truth match) are highlighted in green.
Figure 4: MaxSim interaction score maps for two text-motion pairs. Left: 3D motion. Middle: normalized Motion Image. Right: interaction score map (brighter = stronger alignment).

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

TL;DR

Abstract

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)