Table of Contents
Fetching ...

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian

TL;DR

Proact-VL is presented, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction and achieves superior response latency and quality while maintaining strong video understanding capabilities.

Abstract

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

TL;DR

Proact-VL is presented, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction and achieves superior response latency and quality while maintaining strong video understanding capabilities.

Abstract

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.
Paper Structure (77 sections, 16 equations, 16 figures, 19 tables)

This paper contains 77 sections, 16 equations, 16 figures, 19 tables.

Figures (16)

  • Figure 1: Overview of Proact-VL. The top section shows Proact-VL collaborating with other commentators for real-time commentary, while the bottom section highlights its proactive player guidance capability.
  • Figure 2: Overview of theLive Gaming Dataset. The inner, middle, and outer rings represent the three data categories, 12 specific game titles, and their corresponding genres, respectively.
  • Figure 3: Overview of data pipeline.
  • Figure 4: Illustration of the Proact-VL. At each second, Proact-VL consumes multi-source tokens (video, query, and context) and decides whether to speak by feeding the FLAG hidden state into a response head to obtain a score, then thresholding with $\tau$. If triggered, it appends the assistant prefix and generates a short clip-level text; otherwise, it appends the prefix with a Silence token to output silence.
  • Figure 5: Score curve visualization. Green: labeled response; Red: labeled silence; Dashed line: threshold; Above-threshold scores: model triggers responses.
  • ...and 11 more figures