Table of Contents
Fetching ...

Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi

TL;DR

This paper proposes Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens, significantly improving computational efficiency for both training and inference.

Abstract

Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers--about 1% of the original tokens--Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.

Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

TL;DR

This paper proposes Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens, significantly improving computational efficiency for both training and inference.

Abstract

Recent advancements in vision-language models (VLMs) have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images. In the widely used fully autoregressive transformer-based models like LLaVA, projected visual tokens are prepended to textual tokens. Oftentimes, visual tokens are significantly more than prompt tokens, resulting in increased computational overhead during both training and inference. In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower of VLMs. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance. In our experiment, with merely 8 visual registers--about 1% of the original tokens--Victor shows less than a 4% accuracy drop while reducing the total training time by 43% and boosting the inference throughput by 3.3X.

Paper Structure

This paper contains 29 sections, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Efficiency-Performance Trade-Off Curve. We compare our proposed method, Victor, with the state-of-the-art method FastV. We report the normalized average score across $12$ benchmarks and the corresponding throughput increase relative to the original baseline model (details in \ref{['subsec:eval']}). The size of the circles represents the number of visual tokens for each method, with larger circles representing more tokens. Victor establishes a more favorable Pareto frontier than FastV, demonstrating a significantly smaller performance drop as throughput increases.
  • Figure 2: Method Overview.Victor is a simple yet effective method for enhancing the efficiency of vision-language models. The process involves four key steps based on the LLaVA-style model: (I) appending learned visual register tokens after the visual tokens, where the number of visual registers is much smaller than the number of the visual tokens, (II) using the first $k$ layers of the language tower to summarize visual information into the visual registers, (III) discarding all visual tokens before layer $k$, and (IV) starting from layer $k$, the model performs efficient inference using only the visual registers and textual tokens with significantly reduced sequence length.
  • Figure 3: Token Similarities.
  • Figure 4: Efficiency-Performance Trade-Off Curve. We measure the relative throughput increase compared to the baseline model. The test covers two main scenarios: generating $2$ tokens and generating $128$ tokens. In both cases, the batch size is set to $16$, and the text prompt length is $64$ tokens. For all methods, we use $256$, $128$, $64$, $32$, $16$, and $8$ visual tokens to generate the line plot, and the number of visual tokens for the baseline method is $576$.
  • Figure 5: Performance vs. FLOPs Reduction.
  • ...and 11 more figures