ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman
TL;DR
ELIP introduces a lightweight text-guided visual prompting mechanism that conditions the ViT image encoder with a set of prompt vectors generated by a 3-layer MLP mapping network from the text query, enabling query-aware re-ranking in text-to-image retrieval. It treats retrieval as a two-stage process and demonstrates data-efficient training via global hard sample mining and curated DataCompDR-based datasets, with prompts integrated into the input as $[t_p^1, \dots, t_p^m, t_{CLS}]$. Training employs $L_{InfoNCE}$ for CLIP-style setups or pairwise $L_{Sigmoid}$ for SigLIP variants, and ELIP-B extends to BLIP-2 by using a Q-Former and ITM head with binary cross-entropy; inference re-ranks top-$k$ candidates using both initial and ITM-based scores. Evaluations on standard benchmarks COCO and Flickr, plus OOD benchmarks Occluded COCO and ImageNet-R, show substantial zero-shot gains over CLIP/SigLIP/SigLIP-2 and BLIP-2, with further improvements when the mapping network is fine-tuned on in-domain data. Overall, ELIP provides an efficient, scalable path to adapt large vision-language models to diverse domains for high-precision image retrieval.
Abstract
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
