Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI
Jiangkai Wu, Zhiyuan Ren, Liming Liu, Xinggong Zhang
TL;DR
This work targets the latency bottleneck in AI Video Chat, where autoregressive MLLM inference dominates end-to-end delay. It proposes a shift in RTC objectives toward AI understanding of video and introduces Context-Aware Video Streaming that allocates bitrate to chat-important regions using CLIP-derived semantic correlations, enabling ultra-low bitrate operation with preserved MLLM accuracy. The authors also present DeViBench, the first benchmark specifically designed to measure how degraded video quality impacts MLLM understanding, generated via an automated QA-sample pipeline. Together, these contributions demonstrate substantial bitrate reductions with minimal accuracy loss and outline open questions for proactive context-awareness, long-term memory integration, and token-level optimization to advance practical AI video chat.
Abstract
AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.
