Table of Contents
Fetching ...

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

Qing Yu, Mikihiro Tanaka, Kent Fujiwara

TL;DR

This work addresses the scarcity of large-scale motion-language data by introducing motion patches, a unified representation that transforms 3D motion sequences into ViT-friendly inputs. By transferring ImageNet-pretrained ViT weights to the motion domain and pairing with a DistilBERT text encoder under a CLIP-style symmetric loss, the approach constructs a robust cross-modal latent space for 3D motion and language. The method achieves state-of-the-art text-to-motion and motion-to-text retrieval on HumanML3D and KIT-ML, demonstrates cross-skeleton transfer, zero-shot motion classification, and human interaction recognition, and offers strong evidence for the utility of image-domain priors in motion understanding. This framework reduces data requirements, handles skeleton variability, and broadens practical applications in animation, human–robot interaction, and multimodal motion analysis.

Abstract

To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

TL;DR

This work addresses the scarcity of large-scale motion-language data by introducing motion patches, a unified representation that transforms 3D motion sequences into ViT-friendly inputs. By transferring ImageNet-pretrained ViT weights to the motion domain and pairing with a DistilBERT text encoder under a CLIP-style symmetric loss, the approach constructs a robust cross-modal latent space for 3D motion and language. The method achieves state-of-the-art text-to-motion and motion-to-text retrieval on HumanML3D and KIT-ML, demonstrates cross-skeleton transfer, zero-shot motion classification, and human interaction recognition, and offers strong evidence for the utility of image-domain priors in motion understanding. This framework reduces data requirements, handles skeleton variability, and broadens practical applications in animation, human–robot interaction, and multimodal motion analysis.

Abstract

To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
Paper Structure (30 sections, 2 equations, 9 figures, 11 tables)

This paper contains 30 sections, 2 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Overview of the existing methods and the proposed method. The existing methods train an original Transformer with the joint information from the motion sequences directly, while the proposed method converts them into motion patches and then trains the ViT, which can be initialized with pre-trained weights.
  • Figure 2: Overview of the proposed framework, which consists of a motion encoder and a text encoder. We transform the raw motion sequences into motion patches as the input of the ViT-based motion encoder. We calculate the similarity matrix between text-motion pairs within a batch to train the model. To illustrate this concept, we provide an example batch containing three samples for clarity.
  • Figure 3: Process of building the motion patches for each motion sequence. Given a skeleton, we mark different body parts in different colors. We show the method to construct the motion patch of the right leg. The same process is applied to other body parts.
  • Figure 4: Visualization of motion patches by regarding the joint coordinates as RGB pixels. We show the rendered motions and their text label on the left and the processed motion patches on the right. We can observe different motions reflected in different motion patches.
  • Figure 5: Qualitative results of text-to-motion retrieval. For each query, we show the retrieved motions ranked by text-motion similarity and their accompanying ground-truth text labels. Note that these descriptions are not used in the retrieval process. All motions in the gallery are from the test set and were unseen during training. For the first two examples, the text queries are sampled from the data. For the last example, we query with a free-form text.
  • ...and 4 more figures