Table of Contents
Fetching ...

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin

TL;DR

This work tackles the inefficiency of joint vision–language encoders for large-scale reranking by precomputing vision features offline and compressing them with a lightweight adapter, enabling fast online inference with a compact joint encoder. EDJE preserves competitive retrieval performance while drastically reducing storage and compute, achieving high throughput (up to 50k image–text pairs per second) and minimal per-image storage (as low as around 1 kB in the compressed variant). The approach is modular, works with multiple vision backbones, and is trained with a discriminative objective that combines ITM, MLM, and text-embedding recovery, plus distillation from a full adapter to the compressed version. This provides a practical pathway to deploy vision–language rerankers at web scale, with broad implications for large-scale multimodal retrieval systems and reranking pipelines.

Abstract

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

TL;DR

This work tackles the inefficiency of joint vision–language encoders for large-scale reranking by precomputing vision features offline and compressing them with a lightweight adapter, enabling fast online inference with a compact joint encoder. EDJE preserves competitive retrieval performance while drastically reducing storage and compute, achieving high throughput (up to 50k image–text pairs per second) and minimal per-image storage (as low as around 1 kB in the compressed variant). The approach is modular, works with multiple vision backbones, and is trained with a discriminative objective that combines ITM, MLM, and text-embedding recovery, plus distillation from a full adapter to the compressed version. This provides a practical pathway to deploy vision–language rerankers at web scale, with broad implications for large-scale multimodal retrieval systems and reranking pipelines.

Abstract

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

Paper Structure

This paper contains 20 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Inference efficiency and retrieval performance. (a) Methods with strong discriminative capabilities are dominated by costly ViT feature extraction, prohibiting their practical use for reranking. (b) EDJE achieves competitive zero-shot retrieval performance with up to $53\times$ faster inference. Its token compression makes storing visual features practical, enabling large-scale retrieval.
  • Figure 2: Taxonomy of vision–language joint encoders. Left: Cross-attention models integrate modalities through cross-attention layers interleaved with textual self-attention ALBEFBLIPBLIP2. Middle: Joint foundation models such as BEiT-3 BEIT-3 employ unified self-attention over native visual and textual tokens, enabling full cross-modal interaction. Right: Modern generative VLMs LLaVA combine a pretrained vision encoder with a large language model, tuning the latter to process projected vision tokens as if they originated from text.
  • Figure 3: EDJE architecture overview and adapter. (a) Offline stage (left): images are encoded by the vision encoder and projected by the adapter into a compact set of tokens compatible with the language model. Online stage (right): the small language model consumes the compressed tokens together with text. (b) Token-compression adapter: cross-attention utilizes $k$ universal query tokens that act as feature extractors acting on the visual tokens. The MLP projects the extracted features to the embedding space of the language model.
  • Figure 4: Retrieval performance vs. number of tokens. Flickr image retrieval for varying token counts, illustrating the compression–performance tradeoff.
  • Figure 5: Retrieval performance vs. reranking pool size. Robustness of local and 64-token variants under different pool sizes on Flickr.