Table of Contents
Fetching ...

video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

Guangzhi Sun, Yixuan Li, Xiaodong Wu, Yudong Yang, Wei Li, Zejun Ma, Chao Zhang

TL;DR

The paper tackles the challenge of scalable, real-time understanding of long video streams under fixed memory constraints. It introduces video-SALMONN S, a streaming audio-visual LLM that employs a Hessian-free test-time-training memory (TTT_HF) module and a prompt-dependent memory reader to preserve long-range context without unbounded memory growth. On benchmarks including Video-MME, LVBench, and VideoEvalPro, the 7B/8B models with audio-visual inputs achieve state-of-the-art performance, notably 74.2% overall and 67.8% on long videos, while processing over 3 hours at 1 FPS and 360p. Key innovations are the TTT_HF memory update and the prompt-driven KV-cache reading, which together enable high-fidelity long-form video understanding within a fixed memory budget. This work advances practical streaming video reasoning for long-form content, reducing information loss and enabling scalable deployment.

Abstract

Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.

video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

TL;DR

The paper tackles the challenge of scalable, real-time understanding of long video streams under fixed memory constraints. It introduces video-SALMONN S, a streaming audio-visual LLM that employs a Hessian-free test-time-training memory (TTT_HF) module and a prompt-dependent memory reader to preserve long-range context without unbounded memory growth. On benchmarks including Video-MME, LVBench, and VideoEvalPro, the 7B/8B models with audio-visual inputs achieve state-of-the-art performance, notably 74.2% overall and 67.8% on long videos, while processing over 3 hours at 1 FPS and 360p. Key innovations are the TTT_HF memory update and the prompt-driven KV-cache reading, which together enable high-fidelity long-form video understanding within a fixed memory budget. This work advances practical streaming video reasoning for long-form content, reducing information loss and enabling scalable deployment.

Abstract

Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.

Paper Structure

This paper contains 30 sections, 15 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overall model structure of video-SALMONN S. The video encodings are first passed through the TTT$_\text{HF}$ layer (Section \ref{['sec:ttt']}) followed by a similarity discarding procedure to keep fixed-size memory (Section \ref{['sec:mem']}). The fixed memory is then used as the input to the LLM, optionally using a prompt-dependent reading mechanism (Section \ref{['sec:pd']}). Audio tokens bypass the TTT$_\text{HF}$ layer.
  • Figure 2: The overall workflow of the TTT$_\text{HF}$ layer. The layer works as an RNN model, which updates the current fast-weight $\bm W_{t-1}$ of a MLP model for an incoming mini-batch of token $\bm X_t$ to minimise a reconstruction loss. The Hessian-free method is used to construct the update. This MLP model is then used to generate an output token $\bm Z_t$. The figure is adapted from ttt-rnn.
  • Figure 3: The reconstruction loss Eqn. (\ref{['eqn:recon_L']}) of single mini-batch sample w.r.t. update norm $\|\Delta \bm W_t\|_2$. Samples $\bm X_t$ (at index 10, 40, 70, 100) are from the same input sequence. Projection matrices $\bm\theta_K, \bm\theta_V$ are extracted from a trained TTT layer with standard SGD updates. The update generated by the SGD baseline and the HF method with curvature matrix $\mathbf{B}_\text{MLP}$ (B-MLP) and $\mathbf{B}_\text{LN}$ (B-LN), defined in Eqn. (\ref{['eqn:B']}), are compared with CG iterations 2, 3, 4, and 5.
  • Figure 4: Ablation studies on (a). The influence of the maximum number of frames on two extremely long video benchmarks (Left: LVBench, Right: VideoEvalPro), and (b). The influence of memory size on video-SALMONN S. When memory size exceeds 32k, prompt-dependent reading is used. Baseline refers to similarity merging without prompt-dependent reading.
  • Figure 5: The TTT statistics of an one hour video during inference is shown in this figure. The sequence is run by the checkpoint trained with TTT$_\text{Muon}$, with a learnt per-time-step update norm of 0.1386. To enable fair comparison, the sequence is re-run through the TTT layer with the optimizer replaced with HF and SGD, meanwhile enforcing the same per-time-step update norm. On the left depicts the reconstruction loss $\mathcal{L}(\bm X_t,\bm \eta_t; \bm W_{t-1})$ and on the right depicts the TTT relative output change $\frac{\|\Delta f(\bm\theta_Q\bm X_t; \bm W_{t-1})\|}{\|f(\bm\theta_Q\bm X_t; \bm W_{t-1})\|}$. It can be observed that, although TTT$_\text{Muon}$ achieves the lowest reconstruction loss (by a small margin), the change incurred by each update to the output of the TTT-layer is significantly lower. This indicates the possibility that less information is incorporated into the output of the TTT$_\text{Muon}$ layer, which makes the learning task of downstream LLM more difficult.