Table of Contents
Fetching ...

Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari

TL;DR

This paper introduces ProText, a text-only prompt-learning framework that enhances CLIP generalization by extracting rich contextual knowledge from LLM-generated text without using images during training. A contextual mapping loss embeds LLM-derived descriptions into learnable text prompts, enabling zero-shot transfer to unseen classes and datasets and reducing prompt-engineering costs. Through extensive experiments on four benchmarks, ProText consistently surpasses or matches image-supervised and prompt-ensembling baselines in cross-dataset transfer, base-to-novel generalization, and domain robustness. The approach demonstrates the viability of text-only supervision for scalable, transferable vision-language adaptation with practical implications for low-data regimes and costly labeling scenarios.

Abstract

Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText.

Learning to Prompt with Text Only Supervision for Vision-Language Models

TL;DR

This paper introduces ProText, a text-only prompt-learning framework that enhances CLIP generalization by extracting rich contextual knowledge from LLM-generated text without using images during training. A contextual mapping loss embeds LLM-derived descriptions into learnable text prompts, enabling zero-shot transfer to unseen classes and datasets and reducing prompt-engineering costs. Through extensive experiments on four benchmarks, ProText consistently surpasses or matches image-supervised and prompt-ensembling baselines in cross-dataset transfer, base-to-novel generalization, and domain robustness. The approach demonstrates the viability of text-only supervision for scalable, transferable vision-language adaptation with practical implications for low-data regimes and costly labeling scenarios.

Abstract

Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText.
Paper Structure (25 sections, 4 equations, 11 figures, 12 tables)

This paper contains 25 sections, 4 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Without using any images for supervision, ProText with text-only training improves over CLIP, CuPL, and prior 16-shot image-supervised methods in challenging cross-dataset transfer settings. Prompt ensembling based CuPL performs same as CLIP as it cannot transfer class specific LLM templates to cross-datasets.
  • Figure 2: Overview of ProText framework. (Left) First, diverse captions are generated for training classes using LLM like GPT-3. During training, CLIP text encoders generate prompted class-name feature ($\bm{\Tilde{g}_p}$) from class-name templates with learnable prompts and frozen LLM template feature ($\bm{\Tilde{g}}$) from LLM generated templates. Next, we employ contextual mapping loss to guide learnable prompts to learn a mapping from the prompted class-name feature to the LLM template feature containing more information about the class. This allows the learned prompts to exploit internal knowledge of text encoder complemented by LLM descriptions. (Right) At inference, learned prompts are used with class-name templates, and the standard zero-shot CLIP inference protocol is followed. Moreover, rich contextual information from LLM descriptions mapped within the learned prompts enables its transferability to new classes and datasets.
  • Figure 3: With the same amount of text data, learning contextual prompts with text-only supervision improves CLIP performance in comparison to the prompt ensembling techniques.
  • Figure 4: Cross-dataset transfer setting. CuPL and CLIP perform same for cross-datasets as CuPL source data cannot transfer to cross-datasets. Image-based models are trained on 16-shot ImageNet samples. ProText employ same ImageNet data as CuPL for prompt learning.
  • Figure 5: ProText results with text supervision on each dataset. We compare ProText with CLIP and CuPL. Gains of ProText over CuPL are shown in blue.
  • ...and 6 more figures