UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou; Nagasaka Tomohiro

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou, Nagasaka Tomohiro

TL;DR

UniTAF tackles the challenge of aligning speech and facial expressions by reusing intermediate TTS representations to drive A2F in a modular framework. It introduces a frozen TTS backbone (IndexTTS2), a lightweight Audio Feature Adapter, and an A2F decoder (UniTalker2) to enable joint text-to-speech and speech-to-face generation without rewriting TTS. To mitigate audio–face misalignment, the authors propose a two-stage training strategy and a GT Audio Token replacement mechanism, along with a mouth-state-aware vertex loss that adaptively emphasizes extreme mouth shapes. The work emphasizes engineering feasibility and provides practical guidance for co-design of TTS and A2F, rather than pursuing perceptual benchmarks. Code availability is noted, underscoring its applicability to real-world system design and extension to richer expression modeling.

Abstract

This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

TL;DR

Abstract

Paper Structure (34 sections, 7 equations, 15 figures)

This paper contains 34 sections, 7 equations, 15 figures.

Introduction
Related Work
Text-to-Speech with Intermediate Representation
Audio-driven Facial Animation
Joint Modeling of Speech and Facial Motion
Position of Our Work
Training UniTAF Models
Overall Model Design
Design Choices for Joint TTS–A2F Modeling
Initial Design Objectives of UniTAF
Analysis of Audio–Face Misalignment
Sources of Misalignment
Resolving Misalignment by Injecting Ground-Truth Audio Tokens under a Frozen TTS
Two-Stage Training Strategy
Stage I: Pretraining the Audio Feature Adapter
...and 19 more sections

Figures (15)

Figure 1: Audio–expression misalignment issue during training on the UniTAF dataset.
Figure 2: The process of concatenating GT audio for the joint TTS+A2F model. The left black solid-line pipeline represents the second design choice, where TTS is fully frozen and a complete A2F module is trained, which encounters misalignment between TTS-generated audio and the GT audio in the dataset. The left red dashed-line pipeline represents concatenating GT audio during training to obtain GT audio features, thereby resolving the audio–face misalignment during training. The right pipeline shows training a complete A2F model independently using the audio–face pairs from the dataset.
Figure 3: Two-stage training strategy for IndexTTS2 + UniTalker Decoder. In the first stage, IndexTTS features are projected into the UniTalker feature space; in the second stage, the facial expression generation component is formally trained.
Figure 4: Overview of the UniTAF inference and training pipeline. In the first stage, the Projector layer is trained to align audio features to the UniTalker feature space; in the second stage, the Projector and the A2F Decoder are jointly trained to improve facial expression quality. Throughout training, the TTS outputs are consistently replaced with GT Audio Tokens to ensure data alignment, and the framework supports training the TTS component when the TTS inference distribution differs from the training distribution, in order to compensate for alignment issues.
Figure 5: (a)
...and 10 more figures

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

TL;DR

Abstract

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (15)