Table of Contents
Fetching ...

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan Russell, Fabian Caba Heilbron

TL;DR

This work tackles the instability of paraphrased text-to-image retrieval in dual-encoder vision-language models by proposing a training paradigm that leverages a pretrained language model within the text tower. The authors introduce a paraphrased retrieval benchmark derived from COCO captions and develop several adaptation strategies, notably freezing the language encoder and adding alignment layers, to preserve language knowledge while aligning with image representations. Across ablations and large-scale experiments on CC12M and LAION-400M, the proposed approach significantly improves rank similarity for paraphrased queries, while maintaining or exceeding zero-shot classification and retrieval performance and improving text semantic similarity. The results demonstrate a practical path toward integrating strong language models with vision encoders to enable more predictable and robust paraphrase-aware retrieval helpful for search applications and downstream tasks.

Abstract

In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

TL;DR

This work tackles the instability of paraphrased text-to-image retrieval in dual-encoder vision-language models by proposing a training paradigm that leverages a pretrained language model within the text tower. The authors introduce a paraphrased retrieval benchmark derived from COCO captions and develop several adaptation strategies, notably freezing the language encoder and adding alignment layers, to preserve language knowledge while aligning with image representations. Across ablations and large-scale experiments on CC12M and LAION-400M, the proposed approach significantly improves rank similarity for paraphrased queries, while maintaining or exceeding zero-shot classification and retrieval performance and improving text semantic similarity. The results demonstrate a practical path toward integrating strong language models with vision encoders to enable more predictable and robust paraphrase-aware retrieval helpful for search applications and downstream tasks.

Abstract

In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
Paper Structure (17 sections, 5 equations, 7 figures, 11 tables)

This paper contains 17 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Paraphrased retrieval.\ref{['fig:teaser:left']}: The CLIP model radford2021learning often returns very different top retrievals for two text queries that are paraphrases of each other. In this example, the paraphrased query differs by a single word: "kid"$\rightarrow$"child". \ref{['fig:teaser:right']}: Our approach returns similar top retrievals for paraphrased queries by adapting a frozen pretrained language model during dual encoder training. We outline images in green that appear in the top retrievals for the original and paraphrased queries.
  • Figure 2: Paraphrased samples by GPT-3. We show five sample images, their original queries, and the corresponding paraphrases generated by GPT-3. From top to bottom, we show the human-provided semantic similarity scores. A semantic similarity score of 5 indicates that the paraphrase has an identical meaning to the original query.
  • Figure 3: Adapting pre-trained language models for paraphrased retrieval. We explore different strategies to adapt pre-trained language models for paraphrased retrieval. (a) illustrates the CLIP baseline, which trains a dual encoder that optimizes the InfoNCE Loss ($\mathcal{L}$). For CLIP, both the visual encoder ($\mathcal{V}$) and text encoder ($\mathcal{T}$) are trained with random initialization. (b-e) illustrate the adaptation strategies that we study -- all leveraging a pre-trained language encoder $\mathcal{T}^*$ while training the visual encoder from scratch. This set of adaptation approaches optimize a loss $\mathcal{L}_{\mathcal{T}^*}$. (b) finetunes the text encoder weights, (c) freezes the text encoder weights, (d) inserts bottleneck adapter bapna-firat-2019-simplehoulsby2019parameter layers $\mathcal{A}$ to the text encoder, and (e) appends alignment layers, $\mathcal{A}$, atop of the text encoder.
  • Figure 4: Qualitative results for CLIP radford2021learning (left) and our approach (right). We outline images in green that appear in the top-10 retrievals for both the query and paraphrased query. The full top-10 retrievals can be found in the supplementary material (Figures B.1 & B.2). Please see the text for a discussion of the results.
  • Figure A.1: Paraphrased text-image retrieval dataset samples.
  • ...and 2 more figures