Table of Contents
Fetching ...

Artic: AI-oriented Real-time Communication for MLLM Video Assistant

Jiangkai Wu, Zhiyuan Ren, Junquan Zhong, Liming Liu, Xinggong Zhang

TL;DR

Artic tackles the mismatch between traditional RTC and AI Video Assistants by shifting the optimization objective to MLLM response accuracy and low latency. It introduces three core components: ReCapABR, which caps bitrate to preserve headroom based on MLLM confidence; ZeCoStream, which allocates bitrate to conversation-relevant regions without extra overhead; and DeViBench, a benchmark that quantifies how RTC-induced degradation impacts MLLM accuracy. Evaluations on real 5G uplink traces and trace-driven simulations show substantial gains, including ~15.12% accuracy improvement and ~135.31 ms latency reduction, with manageable overhead. The work provides a practical framework and benchmark suite to align RTC with AI-driven perception tasks, offering a path toward robust, low-latency AI Video Assistants in mobile contexts.

Abstract

AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI-oriented RTC framework for MLLM Video Assistants, exploring the shift from "humans watching video" to "AI understanding video." Specifically, Artic proposes: (1) Response Capability-aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero-overhead Context-aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra-low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC-induced video degradation affects MLLM accuracy. Prototype experiments using real-world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at https://github.com/pku-netvideo/DeViBench.

Artic: AI-oriented Real-time Communication for MLLM Video Assistant

TL;DR

Artic tackles the mismatch between traditional RTC and AI Video Assistants by shifting the optimization objective to MLLM response accuracy and low latency. It introduces three core components: ReCapABR, which caps bitrate to preserve headroom based on MLLM confidence; ZeCoStream, which allocates bitrate to conversation-relevant regions without extra overhead; and DeViBench, a benchmark that quantifies how RTC-induced degradation impacts MLLM accuracy. Evaluations on real 5G uplink traces and trace-driven simulations show substantial gains, including ~15.12% accuracy improvement and ~135.31 ms latency reduction, with manageable overhead. The work provides a practical framework and benchmark suite to align RTC with AI-driven perception tasks, offering a path toward robust, low-latency AI Video Assistants in mobile contexts.

Abstract

AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI-oriented RTC framework for MLLM Video Assistants, exploring the shift from "humans watching video" to "AI understanding video." Specifically, Artic proposes: (1) Response Capability-aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero-overhead Context-aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra-low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC-induced video degradation affects MLLM accuracy. Prototype experiments using real-world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at https://github.com/pku-netvideo/DeViBench.
Paper Structure (27 sections, 4 equations, 16 figures, 2 tables)

This paper contains 27 sections, 4 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: AI Video Assistant is a new paradigm for real-time communication. The user sends video and audio to the AI for thinking. The AI feeds back audio.
  • Figure 2: Latency measurement study. When bandwidth is sufficient, the video bitrate continuously increases, as driven by the CC. However, influenced by mobility and user behaviors, bandwidth suffers sudden drops (e.g., frame 525). The lag in bitrate reduction causes network congestion, resulting in severe latency spikes (e.g., 1,389 ms).
  • Figure 3: Accuracy measurement study. (a)(b) Accuracy exhibits saturation. The MLLM remains consistently accurate as the bitrate increases. (c) Low bitrates cause MLLM errors. (d) Errors persist even with the more powerful Pro model, demonstrating that errors stem from video degradation rather than model capability.
  • Figure 4: Artic overview
  • Figure 5: How to identify response-important regions? Modern MLLMs inherently possess real-time grounding capabilities, enabling them to accurately localize relevant regions (green box) based on the conversational context. Beyond single objects, this capability extends to multiple objects, scene-level contexts, and temporal contexts.
  • ...and 11 more figures