Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans
Hongbin Huang, Junwei Li, Tianxin Xie, Zhuang Li, Cekai Weng, Yaodong Yang, Yue Luo, Li Liu, Jing Tang, Zhijing Shao, Zeyu Wang
TL;DR
The paper addresses the challenge of building high-fidelity digital humans that are both visually realistic and responsive in real time. It proposes Hi-Reco, a modular system integrating a photorealistic 3D avatar, persona-driven speech synthesis, and knowledge-grounded dialogue via a retrieval-augmented generation framework, all coordinated by an asynchronous pipeline to minimize latency. Key contributions include the 3D Avatar Module leveraging DEGAS and Imitator for real-time emotion-matched facial animation; the Speech Module with GPT-SoVITS-based TTS and fast, edge-optimized inference; and the RAG Module with history-augmented retrieval and intent-based routing. Extensive experiments demonstrate improved realism (S-MOS), cross-lingual voice cloning, significant latency reductions (e.g., time to first audio playback reduced by 85%), and faster, more contextually accurate responses. The work enables practical immersive applications in education, communication, and entertainment via responsive digital humans.
Abstract
High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.
