Table of Contents
Fetching ...

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman

TL;DR

ELIP introduces a lightweight text-guided visual prompting mechanism that conditions the ViT image encoder with a set of prompt vectors generated by a 3-layer MLP mapping network from the text query, enabling query-aware re-ranking in text-to-image retrieval. It treats retrieval as a two-stage process and demonstrates data-efficient training via global hard sample mining and curated DataCompDR-based datasets, with prompts integrated into the input as $[t_p^1, \dots, t_p^m, t_{CLS}]$. Training employs $L_{InfoNCE}$ for CLIP-style setups or pairwise $L_{Sigmoid}$ for SigLIP variants, and ELIP-B extends to BLIP-2 by using a Q-Former and ITM head with binary cross-entropy; inference re-ranks top-$k$ candidates using both initial and ITM-based scores. Evaluations on standard benchmarks COCO and Flickr, plus OOD benchmarks Occluded COCO and ImageNet-R, show substantial zero-shot gains over CLIP/SigLIP/SigLIP-2 and BLIP-2, with further improvements when the mapping network is fine-tuned on in-domain data. Overall, ELIP provides an efficient, scalable path to adapt large vision-language models to diverse domains for high-precision image retrieval.

Abstract

The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

TL;DR

ELIP introduces a lightweight text-guided visual prompting mechanism that conditions the ViT image encoder with a set of prompt vectors generated by a 3-layer MLP mapping network from the text query, enabling query-aware re-ranking in text-to-image retrieval. It treats retrieval as a two-stage process and demonstrates data-efficient training via global hard sample mining and curated DataCompDR-based datasets, with prompts integrated into the input as . Training employs for CLIP-style setups or pairwise for SigLIP variants, and ELIP-B extends to BLIP-2 by using a Q-Former and ITM head with binary cross-entropy; inference re-ranks top- candidates using both initial and ITM-based scores. Evaluations on standard benchmarks COCO and Flickr, plus OOD benchmarks Occluded COCO and ImageNet-R, show substantial zero-shot gains over CLIP/SigLIP/SigLIP-2 and BLIP-2, with further improvements when the mapping network is fine-tuned on in-domain data. Overall, ELIP provides an efficient, scalable path to adapt large vision-language models to diverse domains for high-precision image retrieval.

Abstract

The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.

Paper Structure

This paper contains 34 sections, 1 equation, 26 figures, 8 tables.

Figures (26)

  • Figure 1: The ELIP architecture.Left: We propose a novel architecture that can be applied to pre-trained and frozen vision-language foundation models, such as CLIP, SigLIP, SigLIP-2 and BLIP-2, to enhance their text-to-image retrieval performance. The key idea is to use the text query to define a set of visual prompt vectors that are incorporated into the image encoder to make it aware of the query when generating the embedding. An MLP maps from the text space to the visual space of the input to the ViT encoder. The architecture is lightweight, and our data curation strategies enable efficient and effective training with limited resources. Right: In this retrieval example from the COCO benchmark, the top-$k$ ($k$=100) images are re-ranked by our ELIP model for the text query: 'People on bicycles ride down a busy street'. The ground truth image matching the query is not in the top-5 ranked images in the initial CLIP ranking, but is ranked top-1 (highlighted in the dashed box) by the re-ranking.
  • Figure 2: Architecture of ELIP-C / ELIP-S. At training time, a batch of text-image pairs is fed into the architecture. The text feature is mapped to the visual embedding space as a set of prompt vectors via the MLP mapping network and then guides the encoding of the image feature. We use color coding for the [CLS] token, patch tokens, and generated visual tokens from text. The architecture is trained with InfoNCE loss (for ELIP-C) and Sigmoid loss (for ELIP-S/ELIP-S-2), to align the text feature with the corresponding re-computed image feature.
  • Figure 3: Architecture of ELIP-B. Similar to the architecture on CLIP/SigLIP, the MLP Mapping Network maps the text feature to the visual embedding space. The only difference is the text-guided image features are further fed into the Q-Former to cross-attend the input text and then passed through the Image-Text Matching (ITM) Head to predict whether the image and text match or not. As the input image features to the ITM head have been changed, we also fine-tune the ITM head, which is a lightweight MLP network. The network is fed pairs of text and positive/negative image features at training time and is trained with binary cross entropy loss.
  • Figure 4: Examples of generated training batches via global hard sample mining. For each row, the first sample is used to group other samples. Captions for row 1 (from left to right): 'a wooden table with no base'; 'a wooden table with a couple of folding legs on it'; 'a table that has a metal base with an olive wood top'; 'small table outdoors sitting on top of the asphalt'. Captions for row 2 (from left to right): 'a huge body of blue ice floats in a mountain stream'; 'the big chunk of glacier is falling off of the cliff'; 'there is a broken piece of glass that has been broken from the ground'; 'a body of water surrounded by a forest near a mountain'. It can be observed that the images and captions are very similar to each other, and significantly more close than images and captions in a random batch.
  • Figure 5: Examples of the out-of-distribution benchmarks. Occluded COCO is on the left, and ImageNet-R is on the right. For both benchmarks, the positive images contain the object described by the text query while the negative images do not contain the object. We display positive images in the first row and negative images in the second row. For Occluded COCO, the target object in the image is occluded, making it more difficult to be retrieved. For example, for the text query Bicycle in Occluded COCO, positive images have an occluded bicycle (highlighted in dashed box) while negative images do not have a bicycle in it; for the text query Goldfish in ImageNet-R, positive images have goldfish while negative images do not have goldfish.
  • ...and 21 more figures