Table of Contents
Fetching ...

M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

Alexandre Symeonidis-Herzig, Jianhe Low, Ozge Mercanoglu Sincan, Richard Bowden

Abstract

Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.

M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production

Abstract

Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME's rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.
Paper Structure (22 sections, 12 equations, 8 figures, 6 tables)

This paper contains 22 sections, 12 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Sign Motion Language Model. A single autoregressive transformer operates over slp (text$\rightarrow$motion) and slt (motion$\rightarrow$text) within a shared token space. Spoken-language tokens and modality-specific FSQ motion tokens (body, hands, face) are embedded and concatenated with boundary identifiers (e.g., <ASL_RH>, <DGS_F>) to distinguish language and articulator streams. The model predicts the next token across all modalities using a unified encoder and decoder with modality-specific output heads. An auxiliary translation objective encourages semantically grounded token embeddings, improving production quality. Both production and translation share the same encoder-decoder parameters.
  • Figure 2: SMPL-FX (ours) vs. SMPL-X. Facial parameters for SMPL-FX are extracted via Pixel3DMM giebenhain2025pixel3dmm; for SMPL-X we use parameters as extracted by SOKE zuo2025soke. The SMPL-X captures primarily head rotation, failing to capture mouth shape and any smaller expressions.
  • Figure 3: Multi-modal FSQ-VAE Tokenization. Our FSQ-VAE framework discretizes modality-specific SMPL-FX latents into a structured grid. The resulting token streams provide a unified, compact, and discrete multi-modal representation specifically optimized for signing motion.
  • Figure 4: Qualitative comparisons with SOKE zuo2025soke over How2Sign (left), CSL-Daily (middle), and Phoenix14T (right). SOKE's facial outputs are largely static: jaw position and facial expression change minimally across frames and do not track the utterance, a direct consequence of SMPL-X's ten-dimensional face space. M3T generates nmf that vary with the signing content.
  • Figure 5: Face token frequency distributions. VQ produces a heavily skewed distribution — most face codes are unused, a handful dominate — while FSQ yields more uniform coverage. Rarely-used tokens receive insufficient gradient signal, undermining the auxiliary translation objective's ability to ground face embeddings semantically; FSQ's structured grid eliminates this bottleneck.
  • ...and 3 more figures