UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
Qiangong Zhou, Nagasaka Tomohiro
TL;DR
UniTAF tackles the challenge of aligning speech and facial expressions by reusing intermediate TTS representations to drive A2F in a modular framework. It introduces a frozen TTS backbone (IndexTTS2), a lightweight Audio Feature Adapter, and an A2F decoder (UniTalker2) to enable joint text-to-speech and speech-to-face generation without rewriting TTS. To mitigate audio–face misalignment, the authors propose a two-stage training strategy and a GT Audio Token replacement mechanism, along with a mouth-state-aware vertex loss that adaptively emphasizes extreme mouth shapes. The work emphasizes engineering feasibility and provides practical guidance for co-design of TTS and A2F, rather than pursuing perceptual benchmarks. Code availability is noted, underscoring its applicability to real-world system design and extension to richer expression modeling.
Abstract
This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF
