Table of Contents
Fetching ...

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

Abstract

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Abstract

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
Paper Structure (30 sections, 4 equations, 10 figures, 12 tables)

This paper contains 30 sections, 4 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Efficiency comparison: FLOPs reduction vs acc. Notice that our approach is significantly more efficient while also retaining the performance on the harder datasets. See Sects. \ref{['sec:motivation']} and \ref{['ssec:experiments-setup']} for "easy’’-"hard’’ definition.
  • Figure 2: Cross-modality attention patterns across layers. We plot the proportion of attention scores allocated to three interaction types: text queries attending to image tokens (Query-to-Image), answer tokens attending to image tokens (Answer-to-Image), and answer tokens attending to query tokens (Answer-to-Query). For easy tasks like SQA, interaction is sparse and dominated by text-to-text attention. For hard tasks like DocVQA, the model attends to the image across the whole network.
  • Figure 3: Evolution of visual representations across layers, measured by pairwise CKA similarity. For easy tasks (e.g., SQA), visual features remain largely static (high similarity across layers). For harder tasks (e.g., DocVQA), features are progressively refined.
  • Figure 4: Accuracy sensitivity by dropping all vision tokens for different subsets of LLM layers. Left: Accuracy distribution on a dataset-by-dataset basis. Certain datasets (e.g., DocVQA, ChartQA) are particularly sensitive to reduced vision-language interactions. Right: we show how the layer-drop config. & accuracy correlate among datasets. Two clusters emerge: vision-sensitive ("hard") (e.g., InfoVQA, OCRBench, etc.) and coarse vision ("easy") (e.g., POPE, SQA, GQA, etc.) datasets.
  • Figure 5: Conceptual architecture of VISOR. Visual information is sparsely injected into the language stream via a few cross-attention and self-attention layers modelling text-image and image-image interactions. Cross-attention efficiently provides visual context to the text tokens without altering the visual representations. Self-attention, while more costly, refines the visual tokens, enabling subsequent cross-attention layers to access higher-level visual features. This design strikes a balance between efficiency and representational power.
  • ...and 5 more figures