Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters
Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter
TL;DR
The paper addresses the problem of high inference latency in Vision Language Models caused by processing many visual tokens. It introduces inference-time scaling laws, $Y(N,T) = \frac{A}{N^{\alpha}} \cdot \frac{B}{T^{\beta}} + D$, to quantify the trade-off between LLM size $N$ and total tokens $T$ under fixed compute, and validates them across visual reasoning tasks using TokenPacker-based token compression. A key finding is that, for visual reasoning, the compute-optimal regime is to maximize the LLM size within budget while compressing visual inputs to as few tokens as possible (often $V \approx 1$), whereas OCR-like tasks favor retaining more visual tokens. The work further introduces QueCC, a query-based extreme-token-compression method that injects user prompts into the compression process via cross-attention with region-wise downsampling, demonstrating substantial gains at very low token counts. Collectively, these results shift focus toward extreme token compression and tailored compression strategies to enable efficient, high-capacity VLMs with practical real-world impact.
Abstract
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their real-world deployment is often constrained by high latency during inference due to the substantial compute required by the LLM to process the large number of input tokens, predominantly arising from the image. To reduce inference costs, one can either downsize the LLM or reduce the number of input tokens needed to represent the image, the latter of which has been the focus of many recent efforts around token compression. However, it is unclear what the optimal trade-off is given a fixed inference budget. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs is achieved by using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take the first steps toward designing token compression algorithms tailored for high-compression settings, utilizing prompt-based compression of tokens. Our work underscores the performance and efficiency benefits of operating in low visual token regimes and the importance of developing tailored token reduction algorithms for such conditions. Code is available at https://github.com/locuslab/llava-token-compression.
