Language-Guided Transformer Tokenizer for Human Motion Generation

Sheng Yan; Yong Wang; Xin Du; Junsong Yuan; Mengyuan Liu

Language-Guided Transformer Tokenizer for Human Motion Generation

Sheng Yan, Yong Wang, Xin Du, Junsong Yuan, Mengyuan Liu

TL;DR

A Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion, and a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation.

Abstract

In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens--a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.

Language-Guided Transformer Tokenizer for Human Motion Generation

TL;DR

Abstract

Paper Structure (25 sections, 7 equations, 11 figures, 10 tables)

This paper contains 25 sections, 7 equations, 11 figures, 10 tables.

Introduction
Related Work
Method
Preliminary
Transformer-based Tokenizer
Language-Guided Tokenization
Language-Drop Scheme
Experiment
Experimental Setup
Text-driven Motion Generation Comparison
Comparison of Discrete Tokenizers
Ablation Studies
Tokenizer Analysis
Conclusion
Overview
...and 10 more sections

Figures (11)

Figure 1: Comparison between previous CNN-based tokenizers and our Language-Guided Transformer Tokenizer (LG-Tok). Our method aligns language and motion during tokenization, leveraging the transformer's flexibility.
Figure 2: Generation quality on HumanML3D.
Figure 3: Illustration of our LG-Tok framework. Given an input motion sequence and corresponding natural language description, a frozen text encoder (e.g., LLaMA dubey2024llama) extracts text embeddings which are concatenated with learnable latent tokens and motions, and fed into a Transformer-based tokenizer to produce high-level semantic motion tokens. The quantizer then quantized these tokens into discrete codes for downstream generative modeling training. During detokenization, the dequantized embeddings, learnable mask tokens, and corresponding text embeddings interact via cross-attention layers within a Transformer-based detokenizer to reconstruct the motion sequence. For generation, motion tokens sampled by the trained generative model are dequantized and fed into the detokenizer to synthesize diverse, high-fidelity human motion.
Figure 4: Qualitative comparisons on HumanML3D dataset. Our LG-Tok demonstrates superior semantic understanding compared to existing methods. The examples show better spatial awareness ("in the middle"), more realistic posture synthesis ("dodges quickly"), and improved directional control ("then turns to the left").
Figure 5: Evaluation sweep over guidance scale $g$. We evaluate the impact of different guidance scales on generation quality, showing optimal performance at $g=2.0$ for HumanML3D and $g=1.0$ for Motion-X.
...and 6 more figures

Language-Guided Transformer Tokenizer for Human Motion Generation

TL;DR

Abstract

Language-Guided Transformer Tokenizer for Human Motion Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)