Table of Contents
Fetching ...

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, Gang Xiong

TL;DR

This work tackles the prompt-quality bottleneck in vision-language models for image classification by proposing ProAPO, an evolution-based method that progressively refines prompts from task-specific templates to class-specific descriptions. It avoids repeated LLM querying by generating a reusable prompt library and applying edit- and evolution-based operators to create diverse candidates, evaluated via a fitness function that combines accuracy with an entropy constraint to mitigate overfitting. Two sampling strategies further reduce iterations: prompt sampling to initialize high-quality descriptions and group sampling to focus on salient category groups. Across thirteen datasets and multiple backbones, ProAPO consistently outperforms state-of-the-art textual-prompt methods and enhances LLM-description approaches, with added benefits when used with adapters and across backbones, demonstrating strong transferability and zero-human-in-the-loop practicality.

Abstract

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

TL;DR

This work tackles the prompt-quality bottleneck in vision-language models for image classification by proposing ProAPO, an evolution-based method that progressively refines prompts from task-specific templates to class-specific descriptions. It avoids repeated LLM querying by generating a reusable prompt library and applying edit- and evolution-based operators to create diverse candidates, evaluated via a fitness function that combines accuracy with an entropy constraint to mitigate overfitting. Two sampling strategies further reduce iterations: prompt sampling to initialize high-quality descriptions and group sampling to focus on salient category groups. Across thirteen datasets and multiple backbones, ProAPO consistently outperforms state-of-the-art textual-prompt methods and enhances LLM-description approaches, with added benefits when used with adapters and across backbones, demonstrating strong transferability and zero-human-in-the-loop practicality.

Abstract

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

Paper Structure

This paper contains 45 sections, 5 equations, 15 figures, 21 tables, 6 algorithms.

Figures (15)

  • Figure 1: Issues of optimizing class-specific prompts.(a) Due to the hallucination in LLMs, generated descriptions may be inaccurate and lack discrimination between fine-grained categories (see red words). (b) Compared to task-specific templates, we see an explosion in the number of class-specific prompts (see red rectangle). This leads to higher generation costs, iteration times, and the overfitting problem. (c) Overfitting problem: Multiple candidate prompts have the same best training accuracy but variable and low test results (see red circle).
  • Figure 2: Overview of our ProAPO algorithm. We progressively refine prompts from task-specific (green lines) to class-specific (brown lines) levels. Specifically, we first explore the best template by an iterative optimization process (\ref{['sec: template_optim']}). For each iteration, ProAPO generates a set of candidate templates by several operators (\ref{['sec: prompt_generate']}) and filters/refines templates by a fitness score (\ref{['sec: score_function']}). After several iterations, we choose the top-scoring template for description initialization. Subsequently, we introduce two sampling strategies to find a better initial point and reduce traversed categories (\ref{['sec: sample_strategy']}). Similar iterative optimization is then applied to class-specific descriptions.
  • Figure 3: Results of adapter-based methods with different initial prompts. Solid and dotted lines denote prompt initialization with ProAPO and CuPL, respectively.
  • Figure 4: Results of prompt transfer to different backbones. The value denotes performance gains compared to vanilla VLMs. Our optimized prompts of ResNet50 and ViT-B/32 are reported.
  • Figure 5: Performance improvement analysis. (a) Analysis of the effect of single vs. ensemble prompts. * denotes results evaluated in the test set. ATO is our automatic template optimization algorithm. (b) Results of previous description-based methods with prompt optimization by our ATO and ProAPO algorithms.
  • ...and 10 more figures