Table of Contents
Fetching ...

Retrieval-Enhanced Contrastive Vision-Text Models

Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

TL;DR

This work proposes to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions.

Abstract

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.

Retrieval-Enhanced Contrastive Vision-Text Models

TL;DR

This work proposes to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions.

Abstract

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark, where we even outperform the fine-tuned models on unseen classes.
Paper Structure (23 sections, 2 equations, 9 figures, 16 tables)

This paper contains 23 sections, 2 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: RECO works by complementing the frozen representations of pre-trained image-text encoders (such as CLIP) with knowledge retrieved from an external memory. We use an image representation as a query to identify the $k$ most similar images and integrate their associated text embeddings to create a multi-modal representation. Likewise, given a text representation as a query, we find the top-$k$ most similar texts and incorporate their associated images. The fusion of original and retrieved embeddings is done by learning a shallow fusion model to produce improved, multi-modal and knowledge-enhanced versions of the original embeddings. We train for alignment between the refined embeddings, as well as between the refined and original embeddings.
  • Figure 2: Conceptual comparison of uni-/cross- modal search and uni-/cross- fusion. We illustrate the different scenarios for an input image $I$ while the scenarios for text input $T$ are shown in Appendix.
  • Figure 3: (left) Disantangling the effect of additional training and RECO. (middle) Effect of updating the memory after training. (right) Effect of the number $k$ of retrieved elements. We report zero-shot top-1 accuracy on CUB. The CLIP baseline is shown with symbol .
  • Figure 4: Qualitative examples on CUB and Cars datasets. We compare uni- versus cross- modal search for two image queries (top) and two text queries (bottom). Uni-modal search allows to find more suitable matches to the query, which improves the relevancy of the fused elements. We frame in red (resp. green) the unrelevant (resp. relevant) retrieved items to be fused with the query.
  • Figure 5: Qualitative examples of RECO for image and text retrieval. We display image and text queries on the left panel and retrieved captions and images on the right panel. We observe that retrieved images tend to match better with the input original image than retrieved captions with the input original text. For example, the retrieved captions from the aerial view do not mention a lot "mountains" while this is present in the original text. Instead, they mention many specific locations, for example lima, cuzco, arizona or afghanistan, which are not relevant to the original text description. On the contrary, the retrieved images from the text query are semantically similar to the original image. This qualitatively explains why the best of performance of RECO for zero-shot retrieval is achieved by disabling retrieval on the query image and enabling it on the query text (see Table 4 of the main paper).
  • ...and 4 more figures