Table of Contents
Fetching ...

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

TL;DR

Ex-Omni tackles the challenge of unifying speech with 3D facial animation in omni-modal LLMs by decoupling semantic reasoning from temporal synthesis. It introduces discrete speech units as temporal scaffolding and the token-as-query gated fusion (TQGF) mechanism to tightly couple language-based semantics with motion generation. The InstructEx dataset and a staged training protocol enable joint learning of language understanding, speech generation, and 3D facial motion, achieving competitive results against open-source OLLMs. This framework advances natural, speech-guided avatars and embodied agents, with potential for more expressive and temporally stable multimodal interactions.

Abstract

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

TL;DR

Ex-Omni tackles the challenge of unifying speech with 3D facial animation in omni-modal LLMs by decoupling semantic reasoning from temporal synthesis. It introduces discrete speech units as temporal scaffolding and the token-as-query gated fusion (TQGF) mechanism to tightly couple language-based semantics with motion generation. The InstructEx dataset and a staged training protocol enable joint learning of language understanding, speech generation, and 3D facial motion, achieving competitive results against open-source OLLMs. This framework advances natural, speech-guided avatars and embodied agents, with potential for more expressive and temporally stable multimodal interactions.

Abstract

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.
Paper Structure (54 sections, 15 equations, 6 figures, 10 tables)

This paper contains 54 sections, 15 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of the motivation behind Ex-Omni. It supports any combinations of textual and speech inputs, and is capable of unified generation of multimodal outputs, including text, speech, and 3D facial animation.
  • Figure 2: Model architecture of EX-Omni.
  • Figure 3: Case study on 3D facial animation generation. The figure highlights mouth-opening behaviors aligned with phonemes that require large lip movements. (a) Results generated from English speech; (b) Results generated from Chinese speech. "[...]" indicates omitted content for brevity, and parenthetical annotations denote dominant articulation cues.
  • Figure 4: Loss curves on different stages with different parameters' LLMs.
  • Figure 5: Response audio duration distribution.
  • ...and 1 more figures