Table of Contents
Fetching ...

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly, Josh Susskind

TL;DR

This paper improves prompt learning by distilling the textual knowledge from natural language prompts to provide rich priors for those under-represented concepts and dubs such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE).

Abstract

Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In this paper, we improve prompt learning by distilling the textual knowledge from natural language prompts (either human- or LLM-generated) to provide rich priors for those under-represented concepts. We first obtain a prompt ``summary'' aligned to each input image via a learned prompt aggregator. Then we jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at the same time. We dub such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE). AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks (e.g., few-shot classification, VQA) and generation tasks (image captioning) where AAPE achieves competitive performance. We also show AAPE is particularly helpful to handle non-canonical and OOD examples. Furthermore, AAPE learning eliminates LLM-based inference cost as required by baselines, and scales better with data and LLM model size.

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

TL;DR

This paper improves prompt learning by distilling the textual knowledge from natural language prompts to provide rich priors for those under-represented concepts and dubs such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE).

Abstract

Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In this paper, we improve prompt learning by distilling the textual knowledge from natural language prompts (either human- or LLM-generated) to provide rich priors for those under-represented concepts. We first obtain a prompt ``summary'' aligned to each input image via a learned prompt aggregator. Then we jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at the same time. We dub such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE). AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks (e.g., few-shot classification, VQA) and generation tasks (image captioning) where AAPE achieves competitive performance. We also show AAPE is particularly helpful to handle non-canonical and OOD examples. Furthermore, AAPE learning eliminates LLM-based inference cost as required by baselines, and scales better with data and LLM model size.

Paper Structure

This paper contains 41 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Aggregate-and-adapt the textual knowledge in natural language prompts for downstream tasks. (a) For classification of object-centric images, we query GPT-3 to obtain a list of prompts for each class, e.g., the car model of "Jeep Compass SUV 2012". Note how redundant the reference prompts can be (e.g., the first two), and how they can be irrelevant to the image (e.g., the last prompt). Alternatively, for complex tasks like VQA, we use human-generated image captions to depict multi-object images. For all tasks, we first learn to aggregate the reference prompts into an image-aligned "summary" (prompt embedding) based on CLIP reward. Then a prompt generator is jointly trained to generate Aggregate-and-Adapted Prompt Embedding (AAPE), such that the distance between AAPE and the aggregated summary is minimized and the task loss is minimized too for adaptation purpose. (b) At test time, we only keep the prompt generator with the prompt aggregator discarded. Our AAPE is applicable to different vision-language tasks with strong generalization performance.
  • Figure 2: LLM-generated image prompts for ImageNet categories, and the hand-constructed image captions on COCO and Flickr30k datasets. Note ImageNet mainly contains object-centric images with relatively clean background, and the LLM-generated image prompts can describe distinct characteristics of the given classes. While COCO and Flickr30k contain multi-object images with cluttered background, and the hand-constructed captions can represent varying object relations.
  • Figure 3: (a) Input-adapted prompt aggregator which aggregates the embeddings of reference prompts ${\bm{P}}$ into an image-aligned, condensed prompt embedding ${\bm{p}}^a$ based on CLIP reward. (b) Instantiation of our prompt learning approach for image classification. The CLIP model is kept frozen.
  • Figure 4: Quantifying the role of LLM knowledge (distilled with $\mathcal{L}_{\text{distill}}$) in prompt learning.$\mathcal{L}_{\text{distill}}$ consistently improves the base and new class accuracies on 11 classification datasets.
  • Figure 5: AAPE helps disambiguate the classification task. To highlight the textual knowledge encoded in AAPE, we show some reference prompts generated by GPT-3. For both the prompt template and AAPE (before concatenation and projection), we measure their Cosine similarity score with the image. Note the similarity score can be small when using a basic prompt template to match the "altar" class instance on ImageNet. Indeed, in this non-canonical image view, the altar is small and the whole scene can be classified as the easily confused class of "church". Whereas AAPE is able to eliminate confusion by providing additional cues like altar "is a raised table" often at the location of "church". This results in increased image-text similarity. Similarly, the textual cues from AAPE are helpful for the OOD examples in special domains of DTD and EuroSAT.
  • ...and 4 more figures