Table of Contents
Fetching ...

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou, Nagasaka Tomohiro

TL;DR

UniTAF tackles the challenge of aligning speech and facial expressions by reusing intermediate TTS representations to drive A2F in a modular framework. It introduces a frozen TTS backbone (IndexTTS2), a lightweight Audio Feature Adapter, and an A2F decoder (UniTalker2) to enable joint text-to-speech and speech-to-face generation without rewriting TTS. To mitigate audio–face misalignment, the authors propose a two-stage training strategy and a GT Audio Token replacement mechanism, along with a mouth-state-aware vertex loss that adaptively emphasizes extreme mouth shapes. The work emphasizes engineering feasibility and provides practical guidance for co-design of TTS and A2F, rather than pursuing perceptual benchmarks. Code availability is noted, underscoring its applicability to real-world system design and extension to richer expression modeling.

Abstract

This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

TL;DR

UniTAF tackles the challenge of aligning speech and facial expressions by reusing intermediate TTS representations to drive A2F in a modular framework. It introduces a frozen TTS backbone (IndexTTS2), a lightweight Audio Feature Adapter, and an A2F decoder (UniTalker2) to enable joint text-to-speech and speech-to-face generation without rewriting TTS. To mitigate audio–face misalignment, the authors propose a two-stage training strategy and a GT Audio Token replacement mechanism, along with a mouth-state-aware vertex loss that adaptively emphasizes extreme mouth shapes. The work emphasizes engineering feasibility and provides practical guidance for co-design of TTS and A2F, rather than pursuing perceptual benchmarks. Code availability is noted, underscoring its applicability to real-world system design and extension to richer expression modeling.

Abstract

This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF
Paper Structure (34 sections, 7 equations, 15 figures)

This paper contains 34 sections, 7 equations, 15 figures.

Figures (15)

  • Figure 1: Audio–expression misalignment issue during training on the UniTAF dataset.
  • Figure 2: The process of concatenating GT audio for the joint TTS+A2F model. The left black solid-line pipeline represents the second design choice, where TTS is fully frozen and a complete A2F module is trained, which encounters misalignment between TTS-generated audio and the GT audio in the dataset. The left red dashed-line pipeline represents concatenating GT audio during training to obtain GT audio features, thereby resolving the audio–face misalignment during training. The right pipeline shows training a complete A2F model independently using the audio–face pairs from the dataset.
  • Figure 3: Two-stage training strategy for IndexTTS2 + UniTalker Decoder. In the first stage, IndexTTS features are projected into the UniTalker feature space; in the second stage, the facial expression generation component is formally trained.
  • Figure 4: Overview of the UniTAF inference and training pipeline. In the first stage, the Projector layer is trained to align audio features to the UniTalker feature space; in the second stage, the Projector and the A2F Decoder are jointly trained to improve facial expression quality. Throughout training, the TTS outputs are consistently replaced with GT Audio Tokens to ensure data alignment, and the framework supports training the TTS component when the TTS inference distribution differs from the training distribution, in order to compensate for alignment issues.
  • Figure 5: (a)
  • ...and 10 more figures