Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Mitchell Keren Taraday; Shahaf Wagner; Chaim Baskin

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin

TL;DR

This work tackles the inefficiency of joint vision–language encoders for large-scale reranking by precomputing vision features offline and compressing them with a lightweight adapter, enabling fast online inference with a compact joint encoder. EDJE preserves competitive retrieval performance while drastically reducing storage and compute, achieving high throughput (up to 50k image–text pairs per second) and minimal per-image storage (as low as around 1 kB in the compressed variant). The approach is modular, works with multiple vision backbones, and is trained with a discriminative objective that combines ITM, MLM, and text-embedding recovery, plus distillation from a full adapter to the compressed version. This provides a practical pathway to deploy vision–language rerankers at web scale, with broad implications for large-scale multimodal retrieval systems and reranking pipelines.

Abstract

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

TL;DR

Abstract

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)