RITA: A Real-time Interactive Talking Avatars Framework
Wuxinlin Cheng, Cheng Wan, Yupeng Cao, Sihan Chen
TL;DR
RITA tackles latency in talking-avatar generation by decomposing the task into three phases: foundational frame generation, fast dynamic frame matching, and real-time video interpolation, while integrating Large Language Models to drive context-aware dialogue. The method embeds audio into a hyperparameter space and reuses a library of frames, enabling instant frame selection via approximate nearest neighbor search. Real-time interpolation with techniques like RIFE restores smooth motion that aligns with audio, yielding natural lip-sync and expressive facial movements. Empirical results demonstrate faster generation and higher interaction quality than prior offline approaches, enabling practical applications in VR, online education, and interactive gaming.
Abstract
RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind. Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions. By leveraging the latest advancements in generative modeling, we have developed a versatile platform that not only enhances the user experience through dynamic conversational avatars but also opens new avenues for applications in virtual reality, online education, and interactive gaming. This work showcases the potential of integrating computer vision and natural language processing technologies to create immersive and interactive digital personas, pushing the boundaries of how we interact with digital content.
