Table of Contents
Fetching ...

Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

Yifei Ming, Yixuan Li

TL;DR

This work conducts a systematic study of retrieval-augmented task adaptation for vision-language models, focusing on how retrieval modality and logit ensemble influence adaptation to fine-grained downstream data. It compares image-to-image (I2I) and text-to-image (T2I) retrieval, showing that I2I consistently outperforms T2I and can closely approach oracle performance when retrieved data aligns with the target distribution. A central finding is that ensembling the zero-shot CLIP logits with retrieved-sample logits is essential for achieving substantial gains, a claim strengthened by theoretical analyses that bound risks and explain the modality gap and retrieval-induced shifts. The study provides extensive ablations across backbones, seeds, and data mixtures, offering practical design guidelines for effective retrieval-augmented adaptation in low-data regimes and establishing a theoretical foundation for why certain retrieval strategies work better than others.

Abstract

Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.

Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

TL;DR

This work conducts a systematic study of retrieval-augmented task adaptation for vision-language models, focusing on how retrieval modality and logit ensemble influence adaptation to fine-grained downstream data. It compares image-to-image (I2I) and text-to-image (T2I) retrieval, showing that I2I consistently outperforms T2I and can closely approach oracle performance when retrieved data aligns with the target distribution. A central finding is that ensembling the zero-shot CLIP logits with retrieved-sample logits is essential for achieving substantial gains, a claim strengthened by theoretical analyses that bound risks and explain the modality gap and retrieval-induced shifts. The study provides extensive ablations across backbones, seeds, and data mixtures, offering practical design guidelines for effective retrieval-augmented adaptation in low-data regimes and establishing a theoretical foundation for why certain retrieval strategies work better than others.

Abstract

Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.
Paper Structure (44 sections, 10 theorems, 37 equations, 16 figures, 6 tables)

This paper contains 44 sections, 10 theorems, 37 equations, 16 figures, 6 tables.

Key Result

Theorem 4.1

With probability at least $1-\delta$, the following upper bound of the ensemble risk holds: where $L \le \sqrt{\exp(2) + 1}$, $\kappa$ characterizes the inner-class feature concentration (def:concentration), and $\xi$ is either $\xi_{\mathbf{s}}$ for I2I retrieval or $\xi_{\mathbf{t}}$ for T2I retrieval.

Figures (16)

  • Figure 1: Illustration of the retrieval-augmented task adaptation framework for CLIP-like models. (a): Given a downstream target dataset, we first retrieve relevant samples from a web-scale database using seed prompts (T2I) or seed images (I2I). We can then build a K-shot cache by selecting the Top-K similar images per class based on CLIP embeddings. (b) At inference time, the final logit $f^{\text{EN}}$ of a test input is an ensemble (weighted sum) of logits from the zero-shot model $f^{\text{ZOC}}$ and the few-shot cache $f^{\text{RET}}$.
  • Figure 2: Comparison of adaptation performance (in accuracy) of different retrieval methods. Compared to the zero-shot model (purple star), I2I retrieval significantly improves the performance and consistently outperforms T2I retrieval across shots and datasets.
  • Figure 3: Samples from T2I and I2I retrieval. Top row: the main source of noise for T2I retrieval is semantic ambiguity, as the textual queries (e.g., a photo of a cellphone) may not accurately describe the images from target distributions (e.g., cellphones typical in the early 2000s). Middle row: samples retrieved by I2I matches more closely with ID data. Bottom row: images sampled from the target (ID) distribution. More examples can be seen in Appendix \ref{['sec:more_sample']}.
  • Figure 4: Importance of ensemble for I2I retrieval. Ensemble corresponds to the default logit ensemble: $f^{\text{EN}} = \alpha f^{\text{ZOC}} + \gamma f^{\text{RET}}$ with $\alpha,\gamma \in(0,1)$. RET denotes only using $f^{\text{RET}}$ ($\alpha=0,\gamma=1$) and ZOCLIP denotes only using $f^{\text{ZOC}}$ ($\alpha=1,\gamma=0$). By ensembling the prediction with retrieved samples ($K=16$), the performance improvement over zero-shot prediction is significant for most datasets.
  • Figure 5: Impact of architecture. We report the average performance (over all datasets) for I2I retrieval and T2I retrieval under different CLIP backbones and observe consistent trends. Results for individual datasets can be seen in Appendix \ref{['sec:arch_ind']}.
  • ...and 11 more figures

Theorems & Definitions (22)

  • Theorem 4.1: Benefit of uni-modal retrieval
  • Theorem 4.2: Benefit of logit ensemble
  • Definition 4.1: Inner-class concentration and inter-class separation
  • Definition 4.2
  • Definition 4.3: Modality gap
  • Definition 4.4: Retrieval distribution shift
  • Definition 4.5: Knowledge encoded in different modalities
  • Lemma 4.8
  • proof
  • Lemma 4.9
  • ...and 12 more