Table of Contents
Fetching ...

Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans

Hongbin Huang, Junwei Li, Tianxin Xie, Zhuang Li, Cekai Weng, Yaodong Yang, Yue Luo, Li Liu, Jing Tang, Zhijing Shao, Zeyu Wang

TL;DR

The paper addresses the challenge of building high-fidelity digital humans that are both visually realistic and responsive in real time. It proposes Hi-Reco, a modular system integrating a photorealistic 3D avatar, persona-driven speech synthesis, and knowledge-grounded dialogue via a retrieval-augmented generation framework, all coordinated by an asynchronous pipeline to minimize latency. Key contributions include the 3D Avatar Module leveraging DEGAS and Imitator for real-time emotion-matched facial animation; the Speech Module with GPT-SoVITS-based TTS and fast, edge-optimized inference; and the RAG Module with history-augmented retrieval and intent-based routing. Extensive experiments demonstrate improved realism (S-MOS), cross-lingual voice cloning, significant latency reductions (e.g., time to first audio playback reduced by 85%), and faster, more contextually accurate responses. The work enables practical immersive applications in education, communication, and entertainment via responsive digital humans.

Abstract

High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.

Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans

TL;DR

The paper addresses the challenge of building high-fidelity digital humans that are both visually realistic and responsive in real time. It proposes Hi-Reco, a modular system integrating a photorealistic 3D avatar, persona-driven speech synthesis, and knowledge-grounded dialogue via a retrieval-augmented generation framework, all coordinated by an asynchronous pipeline to minimize latency. Key contributions include the 3D Avatar Module leveraging DEGAS and Imitator for real-time emotion-matched facial animation; the Speech Module with GPT-SoVITS-based TTS and fast, edge-optimized inference; and the RAG Module with history-augmented retrieval and intent-based routing. Extensive experiments demonstrate improved realism (S-MOS), cross-lingual voice cloning, significant latency reductions (e.g., time to first audio playback reduced by 85%), and faster, more contextually accurate responses. The work enables practical immersive applications in education, communication, and entertainment via responsive digital humans.

Abstract

High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.

Paper Structure

This paper contains 16 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: System architecture of our digital human framework.
  • Figure 2: An example of semantic embedding-based motion selection.
  • Figure 3: Comparison between segmented and non-segmented audio processing pipelines.
  • Figure 4: Rendering results of the 3D digital human avatar from four canonical views and diverse 3D digital humans.