Table of Contents
Fetching ...

Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

Jiangkai Wu, Zhiyuan Ren, Liming Liu, Xinggong Zhang

TL;DR

This work targets the latency bottleneck in AI Video Chat, where autoregressive MLLM inference dominates end-to-end delay. It proposes a shift in RTC objectives toward AI understanding of video and introduces Context-Aware Video Streaming that allocates bitrate to chat-important regions using CLIP-derived semantic correlations, enabling ultra-low bitrate operation with preserved MLLM accuracy. The authors also present DeViBench, the first benchmark specifically designed to measure how degraded video quality impacts MLLM understanding, generated via an automated QA-sample pipeline. Together, these contributions demonstrate substantial bitrate reductions with minimal accuracy loss and outline open questions for proactive context-awareness, long-term memory integration, and token-level optimization to advance practical AI video chat.

Abstract

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.

Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

TL;DR

This work targets the latency bottleneck in AI Video Chat, where autoregressive MLLM inference dominates end-to-end delay. It proposes a shift in RTC objectives toward AI understanding of video and introduces Context-Aware Video Streaming that allocates bitrate to chat-important regions using CLIP-derived semantic correlations, enabling ultra-low bitrate operation with preserved MLLM accuracy. The authors also present DeViBench, the first benchmark specifically designed to measure how degraded video quality impacts MLLM understanding, generated via an automated QA-sample pipeline. Together, these contributions demonstrate substantial bitrate reductions with minimal accuracy loss and outline open questions for proactive context-awareness, long-term memory integration, and token-level optimization to advance practical AI video chat.

Abstract

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.

Paper Structure

This paper contains 9 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: AI Video Chat is a new paradigm for real-time communication. The user sends video and audio to the AI for thinking. The AI feeds back to the user. Low latency is crucial for making AI act like a real person.
  • Figure 2: MLLM processes video at a very low frame rate (green), so most frames are redundancy (red).
  • Figure 3: How bitrate and packet loss affect latency (with 10 Mbps bandwidth). To optimize video quality, traditional RTC systems select bitrate from the gray region. But in AI video chat, to maintain accuracy, we only need to select bitrate from the yellow region (§\ref{['sec:moti_2']}).
  • Figure 4: Why video should be context-aware in AI Video Chat. In the first dialogue, even if the video bitrate decreases from 4000 Kbps to 200 Kbps, the MLLM can still response accurately. But in the second dialogue from StreamingBench lin2024streamingbench, the blurry video leads to incorrect responses. Thus, rather than reducing bitrate in a context-agnostic manner, bitrate allocation should be determined by the current chat context (§\ref{['sec:moti_3']}).
  • Figure 5: How to achieve context awareness? The user words can indicate which regions in the video are important for the current chat context. Based on CLIP, we can even recognize important regions through high-level understanding. For example, in the third dialogue, the growth of grass implies the current season (§\ref{['sec:moti_3']}).
  • ...and 5 more figures