A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model
Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen
TL;DR
A2-LLM introduces an end-to-end conversational audio avatar LLM that jointly reasons about language, prosody, and 3D facial motion, addressing latency and the Semantic-Emotion Gap inherent in cascaded systems. It leverages RVQ-VAE-based residual motion tokenization and a Motion Connector to ground expressive facial dynamics in semantic context, trained with FLAME-QA, a large multimodal QA dataset designed to enforce context-conditioned facial behavior. A three-stage LoRA-based curriculum (Motion Connector pretraining, LoRA reset joint alignment, and affective instruction tuning) enables stable, expressive joint training, achieving real-time performance (~500 ms latency, 0.7x RTF) and superior expressiveness compared to audio-centric baselines. The work demonstrates that fully integrated language–audio–visual modeling yields emotionally coherent avatars suitable for immersive HCI and VR/XR applications, while highlighting future opportunities in multilingual support and full-body gestures.
Abstract
Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).
