Table of Contents
Fetching ...

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Tian Liu, Huixin Zhang, Shubham Parashar, Shu Kong

TL;DR

This work tackles practical few-shot recognition (FSR) by leveraging Vision-Language Models (VLMs) and retrieval-augmented learning (RAL). It first demonstrates that finetuning the VLM on only few-shot data yields strong gains, then reveals that naive retraining on retrieved data is hampered by domain gaps and data imbalance. The authors propose Stage-Wise Retrieval-Augmented fineTuning (SWAT), a two-stage strategy that end-to-end finetunes on a mix of retrieved and few-shot data and then retrains the classifier on few-shot data, with CutMix augmentation further improving robustness. Across nine benchmarks, SWAT achieves over 6% absolute accuracy gains, outperforming prior methods and highlighting its promise for real-world data-annotation workflows.

Abstract

Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves open data, e.g., the VLM's pretraining dataset, to learn models for better serving downstream tasks. RAL has been studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the few-shot data in the second stage. Extensive experiments on nine popular benchmarks demonstrate that SWAT significantly outperforms previous methods by >6% accuracy.

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

TL;DR

This work tackles practical few-shot recognition (FSR) by leveraging Vision-Language Models (VLMs) and retrieval-augmented learning (RAL). It first demonstrates that finetuning the VLM on only few-shot data yields strong gains, then reveals that naive retraining on retrieved data is hampered by domain gaps and data imbalance. The authors propose Stage-Wise Retrieval-Augmented fineTuning (SWAT), a two-stage strategy that end-to-end finetunes on a mix of retrieved and few-shot data and then retrains the classifier on few-shot data, with CutMix augmentation further improving robustness. Across nine benchmarks, SWAT achieves over 6% absolute accuracy gains, outperforming prior methods and highlighting its promise for real-world data-annotation workflows.

Abstract

Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves open data, e.g., the VLM's pretraining dataset, to learn models for better serving downstream tasks. RAL has been studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the few-shot data in the second stage. Extensive experiments on nine popular benchmarks demonstrate that SWAT significantly outperforms previous methods by >6% accuracy.
Paper Structure (20 sections, 13 figures, 18 tables)

This paper contains 20 sections, 13 figures, 18 tables.

Figures (13)

  • Figure 1: A summary of few-shot recognition (FSR) benchmarking results over nine datasets. Somewhat surprisingly, although underexplored in the literature, finetuning the entire visual encoder solely on few-shot annotated data (green line) already outperforms previous methods lin2023multimodalityclap24 by >3% accuracy! Yet, finetuning only on retrieved data by retrieval augmented learning (RAL, orange line) underperforms the state-of-the-art zero-shot methods parashar2024neglected. This is due to that the retrieved data follows an imbalanced distribution and has domain gaps with the few-shot data (Fig. \ref{['fig:domain_gap']}). By addressing these issues, our SWAT performs the best (red line), achieving $>$6% accuracy better than previous methods. Refer to Appendix Fig. \ref{['fig:compare_sota']} for detailed results on each of the nine datasets.
  • Figure 2: Overview of our Stage-Wise retrieval-Augmented fine-Tuning (SWAT) for few-shot recognition (FSR). Consider the scenario where one wants to train a model on a few examples per concept concerned in data annotation guidelines. SWAT exploits a pretrained Vision-Language Model (VLM) and retrieves open data, e.g., the VLM's pretraining data relevant to the concepts of interest. We observe that the retrieved data follows an imbalanced distribution and has domain gaps from the few-shot examples (Fig. \ref{['fig:domain_gap']}). SWAT addresses the two issues jointly by first end-to-end finetuning the VLM's visual encoder on mixed retrieved and few-shot annotated data, then re-training the classifier only using the few-shot examples. Over nine FSR benchmarks, our SWAT achieves state-of-the-art performance, significantly outperforming previous methods by $>$6% accuracy (Fig. \ref{['fig:ft_retrieve']}).
  • Figure 3: Retrieved data shows domain gaps with downstream few-shot data and follows an imbalanced distribution. Left: we compare retrieved and few-shot annotated images for random categories from five benchmark datasets. The two sets of images exhibit clear domain gaps regarding image styles, background content, and even semantics, e.g., the animal with banded stripes in the DTD dataset. Right: retrieved data follows imbalanced distributions w.r.t concepts defined in different downstream tasks, as the VLM's pretraining set does not contain sufficient examples for certain classes. Due to these two issues, leveraging the retrieved data to improve FSR presents significant challenges. Refer to Table \ref{['tab:dataset_acc']} for quantitative justification of domain gaps, and Appendix Fig. \ref{['fig:imbalanced_all']} and \ref{['fig:retrived_imgs_more']} for additional examples of more datasets.
  • Figure 4: Retraining the classifier on the few-shot data does not suffer from overfitting. We show the testing accuracies by retraining the classifier on 16 few-shot data at different epochs. We perform three runs of training with different random seeds. Results show that the testing accuracy does not decrease with more epochs and shows small standard deviations. We show the accuracy plots for other datasets in Appendix Fig. \ref{['fig:no_overfit_more']}.
  • Figure 5: Comparison of SWAT with state-of-the-art zero-shot and few-shot methods. We show that simply finetuning the whole visual encoder on few-shot data (our few-shot finetuning, green line) outperforms previous FSR methods while finetuning on retrieved data (orange line) underperforms zero-shot methods (e.g., ImageNet, EuroSAT, Food, DTD, and Stanford Cars) due to the large domain gap and imbalanced distributions of retrieved data. Our SWAT (red line) outperforms previous methods by $>$6% w.r.t accuracy over nine datasets, with significant improvements (20-30%) on challenging datasets like Semi-Aves and Aircraft. The results validate the effectiveness of our SWAT in mitigating the domain gap and imbalanced distribution issues. We also show that our SWAT+ (red dashed line) which finetunes both visual encoder and classifier on few-shot data in stage 2 improves further over SWAT (cf. Section \ref{['sec:swat+']}). Detailed performance on each dataset is provided in Table \ref{['tab:compare_sota_detail']}. For Flowers, EuroSAT, DTD, and Stanford Cars datasets, we show that SWAT can be further improved by 1-6% of accuracy with proper filtering on the retrieved data (cf. Table \ref{['tab:dtd_cars']}).
  • ...and 8 more figures