Table of Contents
Fetching ...

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng, Hongsheng Li, Yu Qiao, Peng Gao

TL;DR

CaFo presents a novel cascade of foundation models that jointly leverage CLIP, DINO, DALL-E, and GPT-3 to advance few-shot vision. The core idea is a Prompt, Generate, then Cache pipeline where GPT-3 crafts richer CLIP prompts, DALL-E expands the training set with synthetic images, and a two-key cache blends CLIP and DINO through adaptive weighting. Empirical results on 11 datasets show state-of-the-art performance across 1, 4, 8, and 16-shot regimes and strong zero-shot and distribution-shift robustness. The approach demonstrates that integrating diverse pre-training paradigms via adaptive inference yields substantial gains in data-efficient visual recognition, with promising avenues for incorporating more foundation models in the future.

Abstract

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

TL;DR

CaFo presents a novel cascade of foundation models that jointly leverage CLIP, DINO, DALL-E, and GPT-3 to advance few-shot vision. The core idea is a Prompt, Generate, then Cache pipeline where GPT-3 crafts richer CLIP prompts, DALL-E expands the training set with synthetic images, and a two-key cache blends CLIP and DINO through adaptive weighting. Empirical results on 11 datasets show state-of-the-art performance across 1, 4, 8, and 16-shot regimes and strong zero-shot and distribution-shift robustness. The approach demonstrates that integrating diverse pre-training paradigms via adaptive inference yields substantial gains in data-efficient visual recognition, with promising avenues for incorporating more foundation models in the future.

Abstract

Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.
Paper Structure (43 sections, 6 equations, 14 figures, 10 tables)

This paper contains 43 sections, 6 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: The Cascade Paradigm of CaFo. We adaptively incorporate the knowledge from four types of pre-training methods and achieve a strong few-shot learner.
  • Figure 2: Prompt with GPT-3 brown2020language. As the first step in CaFo, we utilize the pre-trained GPT-3 to produce prompts with rich linguistic semantics for CLIP's textual encoder.
  • Figure 3: Generate via DALL-E pmlr-v139-ramesh21a, then Cache by CLIP radford2021learning and DINO Caron_2021_ICCV. We adopt DALL-E to generate synthetic images to expand the limited few-shot training samples. Then, we construct the cache model with two kinds of keys to adaptively fuse the knowledge from CLIP and DINO.
  • Figure 4: Adaptive Inference with Cache Model. We regard the test image as a query and retrieves CLIP and DINO's knowledge from the corresponding two keys in the cache model. Then, we calculate the distribution similarities between different classification logits for adaptive ensemble.
  • Figure 5: Performance (%) Comparison on ImageNet. We compare CaFo with other methods for different few-shot settings.
  • ...and 9 more figures