Table of Contents
Fetching ...

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters

Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter

TL;DR

The paper addresses the problem of high inference latency in Vision Language Models caused by processing many visual tokens. It introduces inference-time scaling laws, $Y(N,T) = \frac{A}{N^{\alpha}} \cdot \frac{B}{T^{\beta}} + D$, to quantify the trade-off between LLM size $N$ and total tokens $T$ under fixed compute, and validates them across visual reasoning tasks using TokenPacker-based token compression. A key finding is that, for visual reasoning, the compute-optimal regime is to maximize the LLM size within budget while compressing visual inputs to as few tokens as possible (often $V \approx 1$), whereas OCR-like tasks favor retaining more visual tokens. The work further introduces QueCC, a query-based extreme-token-compression method that injects user prompts into the compression process via cross-attention with region-wise downsampling, demonstrating substantial gains at very low token counts. Collectively, these results shift focus toward extreme token compression and tailored compression strategies to enable efficient, high-capacity VLMs with practical real-world impact.

Abstract

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their real-world deployment is often constrained by high latency during inference due to the substantial compute required by the LLM to process the large number of input tokens, predominantly arising from the image. To reduce inference costs, one can either downsize the LLM or reduce the number of input tokens needed to represent the image, the latter of which has been the focus of many recent efforts around token compression. However, it is unclear what the optimal trade-off is given a fixed inference budget. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs is achieved by using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take the first steps toward designing token compression algorithms tailored for high-compression settings, utilizing prompt-based compression of tokens. Our work underscores the performance and efficiency benefits of operating in low visual token regimes and the importance of developing tailored token reduction algorithms for such conditions. Code is available at https://github.com/locuslab/llava-token-compression.

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters

TL;DR

The paper addresses the problem of high inference latency in Vision Language Models caused by processing many visual tokens. It introduces inference-time scaling laws, , to quantify the trade-off between LLM size and total tokens under fixed compute, and validates them across visual reasoning tasks using TokenPacker-based token compression. A key finding is that, for visual reasoning, the compute-optimal regime is to maximize the LLM size within budget while compressing visual inputs to as few tokens as possible (often ), whereas OCR-like tasks favor retaining more visual tokens. The work further introduces QueCC, a query-based extreme-token-compression method that injects user prompts into the compression process via cross-attention with region-wise downsampling, demonstrating substantial gains at very low token counts. Collectively, these results shift focus toward extreme token compression and tailored compression strategies to enable efficient, high-capacity VLMs with practical real-world impact.

Abstract

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their real-world deployment is often constrained by high latency during inference due to the substantial compute required by the LLM to process the large number of input tokens, predominantly arising from the image. To reduce inference costs, one can either downsize the LLM or reduce the number of input tokens needed to represent the image, the latter of which has been the focus of many recent efforts around token compression. However, it is unclear what the optimal trade-off is given a fixed inference budget. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs is achieved by using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., ), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take the first steps toward designing token compression algorithms tailored for high-compression settings, utilizing prompt-based compression of tokens. Our work underscores the performance and efficiency benefits of operating in low visual token regimes and the importance of developing tailored token reduction algorithms for such conditions. Code is available at https://github.com/locuslab/llava-token-compression.

Paper Structure

This paper contains 32 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Inference optimal scaling laws for VLMs. The number of visual tokens ($V$) passed to the LLM (after token compression, § \ref{['sec:prelims_compress']}), along with the LLM parameter count ($N$), directly determine the inference cost of VLMs ($\mathcal{O}(N(Q+V))$), where $Q$ is the text input tokens. Since the downstream performance of VLMs is directly affected by both these factors, it makes it unclear what the optimal trade-off is for a fixed inference compute. In this work, we try to answer this question with our scaling laws. Left: We plot the fitted scaling curves, assuming cached text input tokens ($Q=0$). We observe a surprising trend: for visual reasoning tasks, the compute optimal behavior (dotted black curve) requires using a single visual token with the largest possible language model that can fit under the inference budget. Right: Inference optimal behavior under $Q=50$ requires slightly higher number of visual tokens as the LLM already incurs a fixed cost due to the text tokens.
  • Figure 2: Our scaling laws (fitted on 0.5-7B VLMs), estimate the performance of 14B VLM with an error margin of less than 2%.
  • Figure 3: Performance trends when shifting input text token count and benchmark family. Left: For visual reasoning tasks, as the number of text tokens increases, the impact of increasing the number of visual tokens $V$, i.e., reducing compression, becomes more apparent. Intuitively, at a large enough amount of text tokens, initial increases in visual tokens are only a minor fraction of the overall compute. Right: When the family of tasks shifts from visual reasoning to OCR/text-understanding, the trends shift: visual token count should be the prioritized instead of LLM size.
  • Figure 4: Performances of various LLM size and visual token count combinations with similar inference compute on two families of tasks. For many visual reasoning tasks, increasing the LLM size by decreasing the number of visual tokens improves performance. However, for text recognition tasks, decreasing the number of visual tokens is detrimental to performance.
  • Figure 5: Inference optimal scaling laws for PruMerge: When replacing the token compression algorithm, the main findings still hold: inference-optimal behavior is still to increase the LLM parameter count by reducing visual tokens in fixed compute scenarios.
  • ...and 2 more figures