Table of Contents
Fetching ...

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

TL;DR

This work addresses the latency of Video-LLMs on long video sequences by introducing Sparse-to-Dense (StD), a training-free decoding strategy that uses a sparse top‑K attention draft model together with a dense parallel verifier to achieve lossless acceleration. The sparse draft proposes gamma tokens by selectively caching the most relevant visual KV pairs guided by textual context, while the dense verifier checks these predictions against the full KV cache, ensuring outputs match the original model. Empirical results on LLaVA-OneVision-7B and Qwen2-VL-7B across MLVU and VideoMME show StD delivers up to 1.94× wall-time speedups with no degradation in output quality, requiring minimal integration (about 20 lines of code) and no additional training. This approach enables real-time video understanding with Video-LLMs and offers a practical, deployment-friendly path toward scalable long-form video inference, with future work exploring KV-cache offloading to CPU memory to further alleviate GPU bottlenecks.$

Abstract

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

TL;DR

This work addresses the latency of Video-LLMs on long video sequences by introducing Sparse-to-Dense (StD), a training-free decoding strategy that uses a sparse top‑K attention draft model together with a dense parallel verifier to achieve lossless acceleration. The sparse draft proposes gamma tokens by selectively caching the most relevant visual KV pairs guided by textual context, while the dense verifier checks these predictions against the full KV cache, ensuring outputs match the original model. Empirical results on LLaVA-OneVision-7B and Qwen2-VL-7B across MLVU and VideoMME show StD delivers up to 1.94× wall-time speedups with no degradation in output quality, requiring minimal integration (about 20 lines of code) and no additional training. This approach enables real-time video understanding with Video-LLMs and offers a practical, deployment-friendly path toward scalable long-form video inference, with future work exploring KV-cache offloading to CPU memory to further alleviate GPU bottlenecks.$

Abstract

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94 walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Paper Structure

This paper contains 19 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Effect of $K$ and $\gamma$ on MLVU using LLaVA-OneVision-7B.