Table of Contents
Fetching ...

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li

Abstract

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Abstract

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

Paper Structure

This paper contains 33 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of our Interactive Video Stream Context Management mechanism. The framework uses a dual sliding-window strategy, where $N$ denotes the length of the video window, and $M$ denotes the number of recent QA groups retained in the interaction history outside the video window.
  • Figure 2: The figure illustrates three types of streaming QA interactions. Real-Time QA produces a single immediate response at the query time. Proactive QA produces a single delayed response after sufficient future evidence is observed. Multi-Response QA continuously tracks evolving events and produces multiple responses over time without requiring repeated queries.
  • Figure 3: Overview of the Coarse-to-Fine Streaming Data Engine in AURA. The pipeline comprises five stages: (1) Video Preparation, (2) QA Synthesis, (3) QA Refinement, (4) Streaming Structuring, and (5) Quality Verification.
  • Figure 4: Overview of AURA's end-to-end real-time inference system with video and speech input, multimodal inference, and speech output. The system is designed to support continual streaming perception and low-latency interaction.
  • Figure 5: Training data distribution: Left: QA type distribution; Right: Video domain distribution. It shows that the training set covers diverse question formats and video domains.
  • ...and 1 more figures