Table of Contents
Fetching ...

Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck

Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos

TL;DR

The paper tackles the heavy visual token burden in LVLMs by introducing Fwd2Bot, a double-forward bottleneck that compresses vision tokens into a small, task-agnostic set of summary tokens. The LVLM itself is used to learn this compression through two passes: a first pass builds a compressed representation via a prompt and learnable tokens, and a second pass optimizes generation with an autoregressive objective, while a contrastive loss enhances discriminative quality. Stage-specific LoRA adapters and a combined autoregressive plus contrastive objective enable the compressed tokens to support both generation and image-text retrieval, achieving a 2× compression gain for generative tasks and state-of-the-art results for discriminative benchmarks. The approach enables offline pre-indexing of images and demonstrates strong cross-task performance, suggesting a practical path to efficient, unified LVLM deployment.

Abstract

In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.

Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck

TL;DR

The paper tackles the heavy visual token burden in LVLMs by introducing Fwd2Bot, a double-forward bottleneck that compresses vision tokens into a small, task-agnostic set of summary tokens. The LVLM itself is used to learn this compression through two passes: a first pass builds a compressed representation via a prompt and learnable tokens, and a second pass optimizes generation with an autoregressive objective, while a contrastive loss enhances discriminative quality. Stage-specific LoRA adapters and a combined autoregressive plus contrastive objective enable the compressed tokens to support both generation and image-text retrieval, achieving a 2× compression gain for generative tasks and state-of-the-art results for discriminative benchmarks. The approach enables offline pre-indexing of images and demonstrates strong cross-task performance, suggesting a practical path to efficient, unified LVLM deployment.

Abstract

In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Different methods/paradigms for reducing/compressing the visual tokens in LVLMs.
  • Figure 2: Fwd2Bot training pipeline: A first forward pass from the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of visual summary tokens. Then, using the same LLM (weights of depicted LLMs are shared), a second forward pass processes the language instruction(s) alongside the summary visual tokens for training with a next-token prediction loss $\mathcal{L}_{\mathrm{AR}}$ (see Sec. \ref{['ssec:method-double-fws']}). Furthermore, a contrastive loss $\mathcal{L}_{\mathrm{disc}}$, applied after the first pass, is utilized to further boost the representation strength, especially for discriminative tasks (see Sec. \ref{['ssec:method-disctiminative-adaptation']}). Components marked with are trainable.
  • Figure 3: The norm of the learned LoRA weights adjustment $\Delta W = B A$ for a model trained with either a single LoRA or stage-specific LoRAs.
  • Figure 4: Visualization of attention weights assigned to the 576 visual tokens and the 32 compressed tokens. On the left, we show the cumulative weights assigned to each visual token by the generated tokens for the baseline LLaVa-1.5-7B model. For Fwd2Bot, on the right, we first display the per-visual-token weights assigned by the summary tokens during the first forward pass to produce the summary tokens. We then show the weights assigned to the compressed tokens by the generated tokens during the second forward pass.
  • Figure 5: Captioning with variable number of summary tokens.
  • ...and 2 more figures