Table of Contents
Fetching ...

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Zhaoheng Zheng, Jingmin Wei, Xuefeng Hu, Haidong Zhu, Ram Nevatia

TL;DR

This work addresses low-shot image classification by bridging language and vision through LLaMP, which uses Large Language Models as adaptive prompt learners to condition the CLIP text encoder. By constructing a knowledge cache in the LLM and learning class-specific prompts that are integrated with CLIP via a lightweight, parameter-efficient update scheme, LLaMP improves zero-shot base-to-novel generalization and 16-shot accuracy across 11 datasets, outperforming PSRC and vanilla CLIP. The main contributions include a two-stage LLM prompt mechanism with a knowledge cache, a three-term training objective, and data-efficient vision-language tuning that yields gains on fine-grained and diverse domains. The framework demonstrates the practical impact of leveraging LLM encyclopedic knowledge for improved VL generalization, with potential for broader adoption in low-data scenarios, while suggesting avenues to incorporate language priors earlier in the vision pipeline.

Abstract

Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets. Code will be made available at: https://github.com/zhaohengz/LLaMP.

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

TL;DR

This work addresses low-shot image classification by bridging language and vision through LLaMP, which uses Large Language Models as adaptive prompt learners to condition the CLIP text encoder. By constructing a knowledge cache in the LLM and learning class-specific prompts that are integrated with CLIP via a lightweight, parameter-efficient update scheme, LLaMP improves zero-shot base-to-novel generalization and 16-shot accuracy across 11 datasets, outperforming PSRC and vanilla CLIP. The main contributions include a two-stage LLM prompt mechanism with a knowledge cache, a three-term training objective, and data-efficient vision-language tuning that yields gains on fine-grained and diverse domains. The framework demonstrates the practical impact of leveraging LLM encyclopedic knowledge for improved VL generalization, with potential for broader adoption in low-data scenarios, while suggesting avenues to incorporate language priors earlier in the vision pipeline.

Abstract

Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets. Code will be made available at: https://github.com/zhaohengz/LLaMP.
Paper Structure (12 sections, 12 equations, 4 figures, 9 tables)

This paper contains 12 sections, 12 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Demonstration of LLaMP: (a) LLMs can provide visual descriptions for fine-grained object categories; (b) Zero-shot base-to-novel generalization benefits from the LLM knowledge.
  • Figure 2: An Overview of the LLaMP Framework: We first generate the knowledge cache by passing the query prompt through the LLM $\mathcal{D}$ and use the knowledge cache to encode $\bm{p}_l$, resulting the adaptive prompts $\bm{\Tilde{h}}^i_l=W\bm{h}^i_l + b^i$ for the CLIP text encoder. $\bm{\Tilde{h}}_l$ is combined with regular learnable prompts of $\mathcal{G}$ to generate the final text feature vector $\bm{g_p}$. The image feature vector $\bm{f_p}$ is obtained through a hybrid-tuning strategy combining prompt learning and low-rank adaptation (LoRA).
  • Figure 3: Effect of LLM Prompts on Harmonic Mean. 16 prompts achieve the most balanced performance.
  • Figure 4: Visualization of LLaMP Predictions by GradCAM selvaraju2017grad