Table of Contents
Fetching ...

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji

TL;DR

High-resolution Vision-Language Models incur a token explosion due to dynamic image partitioning, impeding inference on commodity GPUs. HiRED presents an attention-guided two-phase token-dropping framework that allocates a per-partition token budget via CLS-attention in ViT and retains the most informative tokens before LLM processing. Across multiple VLMs and eight benchmarks, HiRED with a 20% budget achieves substantial gains in throughput, latency, and memory while maintaining competitive accuracy, demonstrating the practicality of attention-guided token management for efficient high-resolution multimodal inference. This approach promises scalable deployment of high-resolution VLMs in resource-constrained settings and lays groundwork for further optimization, such as preserving spatial structure with enhanced positional encoding.

Abstract

High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens through multiple transformer networks poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. Code - https://github.com/hasanar1f/HiRED

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

TL;DR

High-resolution Vision-Language Models incur a token explosion due to dynamic image partitioning, impeding inference on commodity GPUs. HiRED presents an attention-guided two-phase token-dropping framework that allocates a per-partition token budget via CLS-attention in ViT and retains the most informative tokens before LLM processing. Across multiple VLMs and eight benchmarks, HiRED with a 20% budget achieves substantial gains in throughput, latency, and memory while maintaining competitive accuracy, demonstrating the practicality of attention-guided token management for efficient high-resolution multimodal inference. This approach promises scalable deployment of high-resolution VLMs in resource-constrained settings and lays groundwork for further optimization, such as preserving spatial structure with enhanced positional encoding.

Abstract

High-resolution Vision-Language Models (VLMs) are widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate an excessive number of visual tokens due to the need to encode multiple partitions of a high-resolution image input. Processing such a large number of visual tokens through multiple transformer networks poses significant computational challenges, particularly for resource-constrained commodity GPUs. To address this challenge, we propose High-Resolution Early Dropping (HiRED), a plug-and-play token-dropping method designed to operate within a fixed token budget. HiRED leverages the attention of CLS token in the vision transformer (ViT) to assess the visual content of the image partitions and allocate an optimal token budget for each partition accordingly. The most informative visual tokens from each partition within the allocated budget are then selected and passed to the subsequent Large Language Model (LLM). We showed that HiRED achieves superior accuracy and performance, compared to existing token-dropping methods. Empirically, HiRED-20% (i.e., a 20% token budget) on LLaVA-Next-7B achieves a 4.7x increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), HiRED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. Code - https://github.com/hasanar1f/HiRED
Paper Structure (28 sections, 2 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 2 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Inference steps of LLaVA-Next llava-next-technical for a high-resolution VLM with dynamic partitioning.
  • Figure 2: The sparse nature of visual tokens is evident during the generation using LLM. (a) Visual tokens receive significantly less attention compared to system and text tokens. (b) The top 20% and 40% of visual tokens account for 60% and 80% of the total attention, respectively.
  • Figure 3: In ViT, CLS-attention map shows distinct characteristics across layers . The initial layers highlight the subject patches while ignoring the background, aligning mostly with the image content. The final layers, however, highlight informative patches where ViT stores most of the image features.
  • Figure 4: Design of HiRED for high-resolution VLMs to drop visual tokens before LLM. We first allocate token budgets for the full-image and sub-images and then select tokens with top feature importance within the allocated budget.
  • Figure 5: Number of visual tokens generated in 100 samples of TextVQA for Full, PruMerge, PruMerge+, and HiRED
  • ...and 1 more figures