Table of Contents
Fetching ...

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

TL;DR

MPVR tackles the manual, potentially biased bottleneck in zero-shot visual recognition by introducing a two-stage meta-prompting workflow that first elicits diverse task-level LLM queries and then generates category-specific VLM prompts filled with class names. The resulting prompts are ensembled to form a robust zero-shot classifier, achieving state-of-the-art performance across 20 diverse datasets with both GPT and Mixtral LLMs. The approach is training-free and dataset-agnostic, and the authors open-source a ~2.5M-description dataset to enable further research. Overall, MPVR demonstrates significant, scalable gains in zero-shot recognition while substantially reducing human prompt engineering effort.

Abstract

Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

TL;DR

MPVR tackles the manual, potentially biased bottleneck in zero-shot visual recognition by introducing a two-stage meta-prompting workflow that first elicits diverse task-level LLM queries and then generates category-specific VLM prompts filled with class names. The resulting prompts are ensembled to form a robust zero-shot classifier, achieving state-of-the-art performance across 20 diverse datasets with both GPT and Mixtral LLMs. The approach is training-free and dataset-agnostic, and the authors open-source a ~2.5M-description dataset to enable further research. Overall, MPVR demonstrates significant, scalable gains in zero-shot recognition while substantially reducing human prompt engineering effort.

Abstract

Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively
Paper Structure (40 sections, 1 equation, 8 figures, 18 tables)

This paper contains 40 sections, 1 equation, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Our MPVR utilizes a Meta Prompt, comprising a system prompt (instruction), in-context example demonstrations (fixed throughout), and metadata (name and description) for a downstream task of interest. The Meta Prompt instructs an LLM to generate diverse task-specific LLM queries, which are used to obtain category-specific VLM prompts (visual text descriptions) by again querying the LLM after specifying the <class name>. These category-specific VLM prompts are then ensembled into a zero-shot classifier for recognizing the downstream task categories.
  • Figure 2: MPVR framework. In the first stage, a meta-prompt comprising of a system prompt, in-context examples, and metadata consisting of downstream task specification is queried to the LLM instructing it to generate multiple diverse task-specific LLM queries, which are populated with the category of interest and again queried to the LLM to obtain the category-level prompts for assembling a zero-shot classifier.
  • Figure 3: Our meta-prompt comprises $3$ parts: A system prompt provides an overview of what is included in the overall prompt and what is expected from the LLM as a response (top). An in-context example consisting of metadata, dataset name, and hand-crafted prompts for the dataset (middle left). The downstream task metadata for which a diverse set of prompts are requested from the LLM (bottom left). For completeness, we also provide the in-context demonstrations (middle right) we use throughout, and the diverse LLM-generated queries for the example ImageNet-R dataset (bottom right).
  • Figure 4: Top-1 mean accuracy (%) for CLIP and standard deviation for $10$ random runs, for all datasets.
  • Figure 5: Top-1 mean accuracy (%) over DTD, EuroSat, Flowers, Resisc45, subsampling the VLM prompts sets.
  • ...and 3 more figures