Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang; Zhipeng Li; Yiwen Guo; Tianshu Yu

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

TL;DR

Ex-Omni tackles the challenge of unifying speech with 3D facial animation in omni-modal LLMs by decoupling semantic reasoning from temporal synthesis. It introduces discrete speech units as temporal scaffolding and the token-as-query gated fusion (TQGF) mechanism to tightly couple language-based semantics with motion generation. The InstructEx dataset and a staged training protocol enable joint learning of language understanding, speech generation, and 3D facial motion, achieving competitive results against open-source OLLMs. This framework advances natural, speech-guided avatars and embodied agents, with potential for more expressive and temporally stable multimodal interactions.

Abstract

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

TL;DR

Abstract

Paper Structure (54 sections, 15 equations, 6 figures, 10 tables)

This paper contains 54 sections, 15 equations, 6 figures, 10 tables.

Introduction
Related Work
Omni-modal Large Language Models.
Facial Animation Generation.
Method
Overview
Unified Speech-Text Representation
LLM-Centered Reasoning
Joint Speech and 3D Facial Animation Generation
Training Strategy
Stage I (Speech-Text Alignment).
Stage II (Speech Generation Pre-training).
Stage III (Speech-Face Co-training).
Stage IV (Joint Fine-tuning).
Training Objectives
...and 39 more sections

Figures (6)

Figure 1: Overview of the motivation behind Ex-Omni. It supports any combinations of textual and speech inputs, and is capable of unified generation of multimodal outputs, including text, speech, and 3D facial animation.
Figure 2: Model architecture of EX-Omni.
Figure 3: Case study on 3D facial animation generation. The figure highlights mouth-opening behaviors aligned with phonemes that require large lip movements. (a) Results generated from English speech; (b) Results generated from Chinese speech. "[...]" indicates omitted content for brevity, and parenthetical annotations denote dominant articulation cues.
Figure 4: Loss curves on different stages with different parameters' LLMs.
Figure 5: Response audio duration distribution.
...and 1 more figures

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

TL;DR

Abstract

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)