Table of Contents
Fetching ...

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, Changsheng Xu

TL;DR

LiveStar addresses the challenge of real-time online video understanding by enabling always-on, context-aware responses through adaptive streaming decoding. It introduces Streaming Causal Attention Masks (SCAM) for streaming video-language alignment, Streaming Verification Decoding (SVeD) for adaptive response timing, and Peak-End memory compression with a streaming key-value cache to handle long contexts efficiently. The OmniStar dataset provides diverse real-world scenarios and five online tasks to benchmark online video understanding. Experiments show state-of-the-art performance across three benchmarks, including substantial gains in semantic correctness and timing alignment while increasing inference speed. This work advances practical online video understanding by combining streaming-aware training, adaptive inference, and scalable evaluation.

Abstract

Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

TL;DR

LiveStar addresses the challenge of real-time online video understanding by enabling always-on, context-aware responses through adaptive streaming decoding. It introduces Streaming Causal Attention Masks (SCAM) for streaming video-language alignment, Streaming Verification Decoding (SVeD) for adaptive response timing, and Peak-End memory compression with a streaming key-value cache to handle long contexts efficiently. The OmniStar dataset provides diverse real-world scenarios and five online tasks to benchmark online video understanding. Experiments show state-of-the-art performance across three benchmarks, including substantial gains in semantic correctness and timing alignment while increasing inference speed. This work advances practical online video understanding by combining streaming-aware training, adaptive inference, and scalable evaluation.

Abstract

Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.

Paper Structure

This paper contains 39 sections, 3 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of online video understanding. (a) Taking the RNG task as an example, online video understanding requires Video-LLMs to handle continuous streams and output at appropriate times; (b) Existing methods overly rely on learning the EOS token, leading to poor inference performance; (c)-(e) LiveStar establishes an effective response-silence training and inference framework by SCAM and SVeD without compromising basic video understanding capabilities.
  • Figure 2: Overview of the streaming verification decoding (SVeD) inference framework: A dynamic response-silence decoding framework designed to determine optimal response timing for online video understanding.
  • Figure 3: Mask matrix of SCAM.
  • Figure 4: Ablation study on the impact of response-silence threshold.
  • Figure 5: Comparison of VideoLLM-online, MMDuet, and LiveStar on the RNG task.
  • ...and 4 more figures