Table of Contents
Fetching ...

TDMM-LM: Bridging Facial Understanding and Animation via Language Models

Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yolo Yunlong Tang, Zichong Xu, Susan Liang, Jing Bi, Jason J Corso, Chenliang Xu

Abstract

Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.

TDMM-LM: Bridging Facial Understanding and Animation via Language Models

Abstract

Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
Paper Structure (10 sections, 13 figures, 5 tables)

This paper contains 10 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the proposed Open3DFaceVid dataset and 3D facial understanding/animation pipeline. The left panel visualizes the Open3DFaceVid corpus, which covers a wide range of identities, emotions, and speaking styles generated via text-to-video (T2V) models. The right panel illustrates our interactive 3D facial interface: given a 3DMM sequence, the user prompts the agent to describe expressions and head motion in natural language, and the agent returns fine-grained, parameter-based interpretations. In the reverse direction, the agent is able to condition on user prompts to generate new 3DMM trajectories with controllable emotion and pose. Please refer to https://songluchuan.github.io/TDMM-LM/ for visualization results and datasets.
  • Figure 2: The analysis of the Open3DFaceVid dataset. We summarize the control categories induced by prompts and their corresponding video counts, broken down by underlying T2V backbones. We further visualize the vocabulary with word clouds, separately for emotion-related terms and for full-text prompts, to highlight the diversity and saliency of affective descriptors.
  • Figure 3: Dataset overview. Top two rows: starting from a fixed text prompt, we vary the random seed and emphasize different prompt keywords to modulate facial identity and video attributes, showcasing subjects across different genders. Bottom three rows: we recover FLAME facial parameters and pair the resulting trajectories with the corresponding prompt, forming Text–3DMM dataset.
  • Figure 4: Geometry-aware facial tokenization learning. We quantize facial expression codes into a discrete codebook and enforce reconstruction in mesh space. The input facial expression codes are mapped to code indices, decoded back to FLAME meshes (bottom), and supervised with an $\mathcal{L}_1$ loss on vertex positions
  • Figure 5: Motion2Language. Geometry sequences are encoded into discrete facial tokens by the geometry encoder and fed, together with text tokens from the user prompt, into a LLM. Conditioned only on these geometry tokens, the agent generates natural-language descriptions of expression/head motion, enabling interactive question to answering about 3D facial behavior.
  • ...and 8 more figures