Table of Contents
Fetching ...

RITA: A Real-time Interactive Talking Avatars Framework

Wuxinlin Cheng, Cheng Wan, Yupeng Cao, Sihan Chen

TL;DR

RITA tackles latency in talking-avatar generation by decomposing the task into three phases: foundational frame generation, fast dynamic frame matching, and real-time video interpolation, while integrating Large Language Models to drive context-aware dialogue. The method embeds audio into a hyperparameter space and reuses a library of frames, enabling instant frame selection via approximate nearest neighbor search. Real-time interpolation with techniques like RIFE restores smooth motion that aligns with audio, yielding natural lip-sync and expressive facial movements. Empirical results demonstrate faster generation and higher interaction quality than prior offline approaches, enabling practical applications in VR, online education, and interactive gaming.

Abstract

RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind. Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions. By leveraging the latest advancements in generative modeling, we have developed a versatile platform that not only enhances the user experience through dynamic conversational avatars but also opens new avenues for applications in virtual reality, online education, and interactive gaming. This work showcases the potential of integrating computer vision and natural language processing technologies to create immersive and interactive digital personas, pushing the boundaries of how we interact with digital content.

RITA: A Real-time Interactive Talking Avatars Framework

TL;DR

RITA tackles latency in talking-avatar generation by decomposing the task into three phases: foundational frame generation, fast dynamic frame matching, and real-time video interpolation, while integrating Large Language Models to drive context-aware dialogue. The method embeds audio into a hyperparameter space and reuses a library of frames, enabling instant frame selection via approximate nearest neighbor search. Real-time interpolation with techniques like RIFE restores smooth motion that aligns with audio, yielding natural lip-sync and expressive facial movements. Empirical results demonstrate faster generation and higher interaction quality than prior offline approaches, enabling practical applications in VR, online education, and interactive gaming.

Abstract

RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind. Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions. By leveraging the latest advancements in generative modeling, we have developed a versatile platform that not only enhances the user experience through dynamic conversational avatars but also opens new avenues for applications in virtual reality, online education, and interactive gaming. This work showcases the potential of integrating computer vision and natural language processing technologies to create immersive and interactive digital personas, pushing the boundaries of how we interact with digital content.
Paper Structure (13 sections, 3 figures, 1 table)

This paper contains 13 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: The overview of RITA. Notice green arrows indicate foundational frame generation, which is not required in real-time inference. The real-time inference only requires the bottom blue arrows.
  • Figure 2: Frame comparison between non-realtime avatar generative model and RITA. Keyframe times are 2, 4, 6, and 8 seconds.
  • Figure 3: Runtime comparison between RITA and Sadtalker. Note that generated frames by RITA can be accessed during the generation, so there is nearly no waiting time for users.

Theorems & Definitions (1)

  • Definition 3.1