A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

Xiaolin Hu; Hang Yuan; Xinzhu Sang; Binbin Yan; Zhou Yu; Cong Huang; Kai Chen

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen

TL;DR

A2-LLM introduces an end-to-end conversational audio avatar LLM that jointly reasons about language, prosody, and 3D facial motion, addressing latency and the Semantic-Emotion Gap inherent in cascaded systems. It leverages RVQ-VAE-based residual motion tokenization and a Motion Connector to ground expressive facial dynamics in semantic context, trained with FLAME-QA, a large multimodal QA dataset designed to enforce context-conditioned facial behavior. A three-stage LoRA-based curriculum (Motion Connector pretraining, LoRA reset joint alignment, and affective instruction tuning) enables stable, expressive joint training, achieving real-time performance (~500 ms latency, 0.7x RTF) and superior expressiveness compared to audio-centric baselines. The work demonstrates that fully integrated language–audio–visual modeling yields emotionally coherent avatars suitable for immersive HCI and VR/XR applications, while highlighting future opportunities in multilingual support and full-body gestures.

Abstract

Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

TL;DR

Abstract

-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A

-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).

Paper Structure (23 sections, 17 equations, 4 figures, 6 tables)

This paper contains 23 sections, 17 equations, 4 figures, 6 tables.

Introduction
Related Works
Preliminaries
End-to-End Audio-Language Models
FLAME 3D Facial Representation
Problem Formulation
FLAME-QA Dataset
Method
Residual Motion Tokenization
Motion Connector with Hierarchical Context
Multimodal Large Language Model and Training Strategy
Training Objective
Experiments
Real-time Behavior
Language Capabilities
...and 8 more sections

Figures (4)

Figure 1: Schematic of A2-LLM.Left: The unified framework where the Shared LLM Layer and Motion Connector are jointly optimized to generate synchronized speech and facial dynamics. Right: The Motion Connector predicts hierarchical tokens via Motion Heads. It adopts a segment-wise autoregressive design, utilizing audio hidden states (serving as Queries) to attend to the buffered motion history (Keys and Values) for temporal continuity.
Figure 2: Illustration of the Motion Tokenization Module. The module encodes sequences of continuous FLAME parameters into compact hierarchical discrete tokens via Residual Vector Quantization (RVQ), enabling efficient autoregressive modeling.
Figure 3: Visual comparison of facial expressiveness. A2-LLM can spontaneously generate facial motions that better match the content of the reply without requiring explicit emotion conditioning.
Figure 4: Prompt used by GPT-5.1 to clean ASR transcripts, assess their suitability as QA answers, and generate corresponding questions in JSON format for constructing the FLAME-QA dataset.

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

TL;DR

Abstract

A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)