Table of Contents
Fetching ...

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li

TL;DR

The Sparrow framework is proposed, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise.

Abstract

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

TL;DR

The Sparrow framework is proposed, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise.

Abstract

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
Paper Structure (30 sections, 4 equations, 6 figures, 8 tables)

This paper contains 30 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Analysis of the impact of visual token length on performance. (a) Comparison of average accepted length and drafting latency between MSD and ViSpec on the VideoDetailCaption dataset across varying visual sequence lengths. (b) Performance on image and video tasks with MSD at different visual retention rates. The Last Instr. and All Text strategies retain the top-x% visual tokens based on the attention scores of the final instruction token and the average attention scores of all text tokens towards visual regions, respectively. ViSpec and MSD employ Qwen2.5-VL-7B and Qwen2-VL-7B as target models, respectively, with both utilizing the official released weights for their draft models.
  • Figure 2: (a) & (b) Visualization of the average attention weight distribution from the last instruction token to preceding Instruction, Visual, and Text tokens in short (0.4k) and long (3k) sequence tasks.
  • Figure 3: Layer-wise Importance Analysis of Visual Tokens in Qwen2.5-VL-7B. (a) Comparison of task accuracy (solid lines) after removing visual tokens starting from layer x versus the native baseline (dashed lines). (b) Sum of attention received by all visual tokens from the last instruction token across different attention heads and model layers.
  • Figure 4: Illustration of the Sparrow framework. The left and right panels illustrate the training and inference phases, respectively. Initially, the target model performs multi-stage fusion on visual embeddings $v_i$ and text embeddings $e_i$, yielding a noise-filtered visual hidden state $h^m_{v_i}$ and a visually-infused text hidden state $h^h_{e_i}$. In the training phase, $h^m_{v_i}$ and $h^h_{e_i}$ are concatenated and fed into the draft model to produce $h_{e_i}$. Subsequently, $h^m_{v_i}$ is concatenated with $h_{e_i}$ and re-input into the draft model. Losses from both stages are jointly computed to mitigate the discrepancy between training and inference. In the inference phase, the draft model directly takes $h^h_{e_i}$ as input.
  • Figure 5: Layer-wise importance analysis of visual tokens in LLaVA-OneVision-7B. The figure compares task accuracy after removing visual tokens starting from layer x (solid lines) with the native baseline performance (dashed lines).
  • ...and 1 more figures