video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

Guangzhi Sun; Yixuan Li; Xiaodong Wu; Yudong Yang; Wei Li; Zejun Ma; Chao Zhang

video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

Guangzhi Sun, Yixuan Li, Xiaodong Wu, Yudong Yang, Wei Li, Zejun Ma, Chao Zhang

TL;DR

The paper tackles the challenge of scalable, real-time understanding of long video streams under fixed memory constraints. It introduces video-SALMONN S, a streaming audio-visual LLM that employs a Hessian-free test-time-training memory (TTT_HF) module and a prompt-dependent memory reader to preserve long-range context without unbounded memory growth. On benchmarks including Video-MME, LVBench, and VideoEvalPro, the 7B/8B models with audio-visual inputs achieve state-of-the-art performance, notably 74.2% overall and 67.8% on long videos, while processing over 3 hours at 1 FPS and 360p. Key innovations are the TTT_HF memory update and the prompt-driven KV-cache reading, which together enable high-fidelity long-form video understanding within a fixed memory budget. This work advances practical streaming video reasoning for long-form content, reducing information loss and enabling scalable deployment.

Abstract

Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.

video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

TL;DR

Abstract

video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)