Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Jintao Rong; Hao Chen; Linlin Ou; Tianxiao Chen; Xinyi Yu; Yifan Liu

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Jintao Rong, Hao Chen, Linlin Ou, Tianxiao Chen, Xinyi Yu, Yifan Liu

TL;DR

RePrompt proposes retrieval-enhanced visual prompt learning (RePrompt), which introduces retrieval mechanisms to cache and reuse the knowledge of downstream tasks to improve generalization for downstream tasks.

Abstract

The Contrastive Language-Image Pretraining (CLIP) model has been widely used in various downstream vision tasks. The few-shot learning paradigm has been widely adopted to augment its capacity for these tasks. However, current paradigms may struggle with fine-grained classification, such as satellite image recognition, due to widening domain gaps. To address this limitation, we propose retrieval-enhanced visual prompt learning (RePrompt), which introduces retrieval mechanisms to cache and reuse the knowledge of downstream tasks. RePrompt constructs a retrieval database from either training examples or external data if available, and uses a retrieval mechanism to enhance multiple stages of a simple prompt learning baseline, thus narrowing the domain gap. During inference, our enhanced model can reference similar samples brought by retrieval to make more accurate predictions. A detailed analysis reveals that retrieval helps to improve the distribution of late features, thus, improving generalization for downstream tasks. Reprompt attains state-of-the-art performance on a wide range of vision datasets, including 11 image datasets, 3 video datasets, 1 multi-view dataset, and 4 domain generalization benchmarks.

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 7 figures, 10 tables)

This paper contains 16 sections, 9 equations, 7 figures, 10 tables.

Introduction
Related Work
Preliminaries
Proposed Method
Retrieval Module
Retrieval-enhanced Visual Prompting
Retrieval-based Adapter
Retrieval-guided Training
Experiments
Few-shot Classification
Domain Generalization
Additional Few-shot Classification
Ablation Study
Retrieval Discussions
Analysis of Retrieval
...and 1 more sections

Figures (7)

Figure 1: The overall architecture of vision-language prompt tuning. Visual and textual prompt tokens, which are the only learnable parameters in this setup, are incorporated into the vision and language branches of CLIP, respectively. These prompts are designed to optimize performance in low-shot scenarios while preserving the model's original generalization capabilities.
Figure 2: The overall workflow of RePrompt includes four main steps: (a) Encoding an image input into a query embedding using a frozen image encoder; (b) Encoding each image entry from the training dataset into key and value embedding pairs with the same frozen image encoder, where the value embeddings also include one-hot representations of labels. We retrieve the top-K relevant knowledge items through maximum inner product search and integrate this knowledge to generate visual prompts; (c) Retrieval-enhanced visual prompts are introduced into the $J$ layers of the visual branch inputs, while the other prompts remain consistent with those of the baseline VLPT; (d) The final output is derived from a linear combination of the prompt-tuned CLIP prediction and the retrieval-aided prediction. The retrieval database for each dataset comes from a corresponding few-shot training set for fairness.
Figure 3: Overview of visual prompt learner and REConv. A visual prompt learner comprises the REConv to generate dynamic visual prompts by learning on retrieved results
Figure 4: Main results over 11 datasets under the few-shot settings. We report the average accuracy($\%$) of three runs for 1,2,4,8,16 shots. The proposed RePrompt achieves significant performance improvements on most downstream recognition datasets.
Figure 5: Visualization of attention response map between retrieval-enhanced visual prompts and image patch tokens. The mean self-attention map is from the last vision transformer layers.
...and 2 more figures

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

TL;DR

Abstract

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (7)