Table of Contents
Fetching ...

Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps

Jiayang Xu, Xiangjie Huang, Zijie Li, Zili Meng

TL;DR

This work presents the first systematic benchmark for AI video chat, evaluating five mainstream apps across four dimensions—quality, latency, internal mechanisms, and system overhead—using cloud and local testbeds over multiple regions and days. It reveals a persistent trade-off between response quality and latency, highlights differences in input modalities (end-to-end vs cascaded) and network protocols (RTP/RTCP vs QUIC), and identifies bottlenecks such as memory behavior, scheduling, and bandwidth sensitivity. The study provides a real-world baseline, curated datasets (including streaming-focused and AI-RTC task datasets), and repeatable testbeds to support future optimization of AI video chat systems and to inform designers of system-level trade-offs. Overall, the results emphasize practical challenges in achieving seamless, proactive AI video conversations and offer concrete directions for improving memory management, protocol choices, and resource efficiency in RTC-based AI assistants.

Abstract

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark with carefully designed metrics across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate five mainstream AI video chatbots with this benchmark. This work provides the research community a baseline of real-world performance and identifies unique system bottlenecks. In the meantime, our benchmarking results also open up several research questions for future optimizations of AI video chatbots.

Make a Video Call with LLM: A Measurement Campaign over Five Mainstream Apps

TL;DR

This work presents the first systematic benchmark for AI video chat, evaluating five mainstream apps across four dimensions—quality, latency, internal mechanisms, and system overhead—using cloud and local testbeds over multiple regions and days. It reveals a persistent trade-off between response quality and latency, highlights differences in input modalities (end-to-end vs cascaded) and network protocols (RTP/RTCP vs QUIC), and identifies bottlenecks such as memory behavior, scheduling, and bandwidth sensitivity. The study provides a real-world baseline, curated datasets (including streaming-focused and AI-RTC task datasets), and repeatable testbeds to support future optimization of AI video chat systems and to inform designers of system-level trade-offs. Overall, the results emphasize practical challenges in achieving seamless, proactive AI video conversations and offer concrete directions for improving memory management, protocol choices, and resource efficiency in RTC-based AI assistants.

Abstract

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via real-time video communication (RTC), just like chatting with real people. Despite its significance, no systematic study has characterized the performance of existing AI video chat systems. To address this gap, this paper proposes a comprehensive benchmark with carefully designed metrics across four dimensions: quality, latency, internal mechanisms, and system overhead. Using custom testbeds, we further evaluate five mainstream AI video chatbots with this benchmark. This work provides the research community a baseline of real-world performance and identifies unique system bottlenecks. In the meantime, our benchmarking results also open up several research questions for future optimizations of AI video chatbots.

Paper Structure

This paper contains 25 sections, 19 figures, 7 tables.

Figures (19)

  • Figure 1: AI video chat paradigm
  • Figure 2: AI video chat differs from related applications
  • Figure 3: Evaluation objectives for AI video chat
  • Figure 4: The definition of AI visual content memory
  • Figure 5: Overview of cloud and local testbeds
  • ...and 14 more figures