Table of Contents
Fetching ...

FlexAttention for Efficient High-Resolution Vision-Language Models

Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining Hong, Zhenfang Chen, Yikang Shen, Chuang Gan

TL;DR

High-resolution vision–language models suffer from quadratic attention costs when processing dense HR tokens. FlexAttention introduces a dynamic high-resolution feature selection mechanism and a hierarchical self-attention module to fuse selected HR details with low-resolution context, achieving $T = \mathcal{O}((M+N)ND)$ complexity versus the naive $T_{original} = \mathcal{O}((M+N)^2D)$. Evaluations on V* Bench, MagnifierBench, TextVQA, RSVQA-HRBEN show that FlexAttention yields superior accuracy while reducing TFLOPs by about 40% relative to high-resolution baselines, with competitive performance against GPT-4V on some metrics. The approach is plug‑and‑play for existing VLMs and has potential to extend to other long-sequence modalities such as video or audio, enabling practical deployment of high-resolution multimodal reasoning.

Abstract

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.

FlexAttention for Efficient High-Resolution Vision-Language Models

TL;DR

High-resolution vision–language models suffer from quadratic attention costs when processing dense HR tokens. FlexAttention introduces a dynamic high-resolution feature selection mechanism and a hierarchical self-attention module to fuse selected HR details with low-resolution context, achieving complexity versus the naive . Evaluations on V* Bench, MagnifierBench, TextVQA, RSVQA-HRBEN show that FlexAttention yields superior accuracy while reducing TFLOPs by about 40% relative to high-resolution baselines, with competitive performance against GPT-4V on some metrics. The approach is plug‑and‑play for existing VLMs and has potential to extend to other long-sequence modalities such as video or audio, enabling practical deployment of high-resolution multimodal reasoning.

Abstract

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.
Paper Structure (19 sections, 6 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overview of VLMs processing high-resolution images for the VQA task.(a) low-resolution VLM will first downsample the high-resolution image to meet its vision encoder requirement. The detail in the low-resolution image is missing, thus it is hard for it to correctly answer the question. (b) high-resolution VLM can take the high-resolution image as input, at the cost of a large amount of high-resolution image tokens, leading to excessive computational cost. (c) Equipped with our FlexAttention, the model encodes the whole high-resolution image and dynamically selects a small portion of the high-resolution feature that the model is paying attention to during the generation, thus avoiding the high computational cost.
  • Figure 2: An Overview of FlexAttention. Within each FlexAttention layer, the encoded high-resolution image features are selected according to the input attention map. These selected features are then inputted into the hierarchical self-attention mechanism alongside input hidden states, which encompass both low-resolution image tokens and text tokens, for computation.
  • Figure 3: Illustration of high-resolution feature selection module.
  • Figure 4: Ablation studies of selection strategies (left) and image sizes (right).