Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan; Yuhong Dai; Qi Ran; Haodong Li; Wang Lin; Hao Liao; Xing Xie; Tao Jin; Jianxun Lian

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian

TL;DR

Proact-VL is presented, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction and achieves superior response latency and quality while maintaining strong video understanding capabilities.

Abstract

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

TL;DR

Abstract

Paper Structure (77 sections, 16 equations, 16 figures, 19 tables)

This paper contains 77 sections, 16 equations, 16 figures, 19 tables.

Introduction
Related Work
Large Multimodal Models
Streaming and Proactive Video Understanding
The Live Gaming Dataset and Benchmark
Video Data Collection
Data Processing
Commentary Data Processing
Guide Data Processing
Persona Enrichment
Benchmark Construction
Methodology
Chunk-Wise Input Schema
Proactive Response Mechanism
Training Strategy
...and 62 more sections

Figures (16)

Figure 1: Overview of Proact-VL. The top section shows Proact-VL collaborating with other commentators for real-time commentary, while the bottom section highlights its proactive player guidance capability.
Figure 2: Overview of theLive Gaming Dataset. The inner, middle, and outer rings represent the three data categories, 12 specific game titles, and their corresponding genres, respectively.
Figure 3: Overview of data pipeline.
Figure 4: Illustration of the Proact-VL. At each second, Proact-VL consumes multi-source tokens (video, query, and context) and decides whether to speak by feeding the FLAG hidden state into a response head to obtain a score, then thresholding with $\tau$. If triggered, it appends the assistant prefix and generates a short clip-level text; otherwise, it appends the prefix with a Silence token to output silence.
Figure 5: Score curve visualization. Green: labeled response; Red: labeled silence; Dashed line: threshold; Above-threshold scores: model triggers responses.
...and 11 more figures

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

TL;DR

Abstract

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Authors

TL;DR

Abstract

Table of Contents

Figures (16)