Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat, Yassine Ouali, Georgios Tzimiropoulos
TL;DR
The paper tackles the heavy visual token burden in LVLMs by introducing Fwd2Bot, a double-forward bottleneck that compresses vision tokens into a small, task-agnostic set of summary tokens. The LVLM itself is used to learn this compression through two passes: a first pass builds a compressed representation via a prompt and learnable tokens, and a second pass optimizes generation with an autoregressive objective, while a contrastive loss enhances discriminative quality. Stage-specific LoRA adapters and a combined autoregressive plus contrastive objective enable the compressed tokens to support both generation and image-text retrieval, achieving a 2× compression gain for generative tasks and state-of-the-art results for discriminative benchmarks. The approach enables offline pre-indexing of images and demonstrates strong cross-task performance, suggesting a practical path to efficient, unified LVLM deployment.
Abstract
In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, (c) is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a "double-forward pass" training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.
