Table of Contents
Fetching ...

BlindSight: Harnessing Sparsity for Efficient Vision-Language Models

Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt

TL;DR

The paper tackles the TTFT bottleneck in multi-image vision-language models by exposing and exploiting inherent inter-image attention sparsity. It introduces BlindSight, a two-stage offline framework that characterizes attention head sparsity per prompt and aggregates patterns across a dataset to produce prompt-independent masks, together with a Triton-based sparse attention kernel that accelerates inter-image attention with negligible accuracy loss ($\approx$0.78% on average). BlindSight delivers $1.8$–$3.2\times$ speedups in attention with context lengths spanning $36{,}000$ to $300{,}000$ tokens and generalizes across Qwen2-VL, Qwen2.5-VL, and Gemma 3, while remaining compatible with token-pruning methods. The work also analyzes the source of sparsity, emphasizes the role of image-delimiter tokens in enabling it, and argues for future VLM designs that integrate a mix of sparse and dense attention layers for improved efficiency and scalability.

Abstract

Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.

BlindSight: Harnessing Sparsity for Efficient Vision-Language Models

TL;DR

The paper tackles the TTFT bottleneck in multi-image vision-language models by exposing and exploiting inherent inter-image attention sparsity. It introduces BlindSight, a two-stage offline framework that characterizes attention head sparsity per prompt and aggregates patterns across a dataset to produce prompt-independent masks, together with a Triton-based sparse attention kernel that accelerates inter-image attention with negligible accuracy loss (0.78% on average). BlindSight delivers speedups in attention with context lengths spanning to tokens and generalizes across Qwen2-VL, Qwen2.5-VL, and Gemma 3, while remaining compatible with token-pruning methods. The work also analyzes the source of sparsity, emphasizes the role of image-delimiter tokens in enabling it, and argues for future VLM designs that integrate a mix of sparse and dense attention layers for improved efficiency and scalability.

Abstract

Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.

Paper Structure

This paper contains 40 sections, 1 equation, 18 figures, 10 tables, 7 algorithms.

Figures (18)

  • Figure 1: (a) Attention matrix for an attention head in Qwen2.5-VL (7B). The input prompt consists of text followed by 4 images. Notice that the image tokens vastly outnumber text tokens. (b) Impact of number of input images on intra/inter-image attention FLOPs.
  • Figure 2: Sparse attention categories in VLMs for prompts with multiple images
  • Figure 3: Distribution of sparsity categories across VLMs
  • Figure 4: BlindSight Kernel: Performance improvements in Qwen 2.5-VL (7B)
  • Figure 5: Impact of removing image boundary tokens: Removing <image_start> and <image_end> impairs attention sinks and disrupts the intra-image masking pattern.
  • ...and 13 more figures