Table of Contents
Fetching ...

Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts

Tong Liang, Jim Davis

TL;DR

This work tackles zero-shot image classification with CLIP by injecting label hierarchy into language prompts. It introduces HAPrompts, a training-free approach that uses two hierarchy-aware prompt families—comparative prompts and path-based prompts—to generate LLM-based image prompts for leaf classes, whose embeddings form class classifiers $T_k$ that interact with image embeddings $I_x$ via $\\hat{y}_x = \\arg\\max_k I_x \\cdot T_k$. By aggregating embeddings from all generated prompts, HAPrompts better captures hierarchical semantic relationships and reduces mistake severity measured by hierarchical distance, outperforming several baselines across five datasets. The method demonstrates robustness to different LLMs (e.g., ChatGPT, Claude, Gemini) and maintains strong performance with an embedding-space ensemble. Overall, the paper provides a practical, dataset-agnostic, training-free strategy for making better mistakes in CLIP-based zero-shot classification, with code and prompts available on GitHub.

Abstract

Recent studies are leveraging advancements in large language models (LLMs) trained on extensive internet-crawled text data to generate textual descriptions of downstream classes in CLIP-based zero-shot image classification. While most of these approaches aim at improving accuracy, our work focuses on ``making better mistakes", of which the mistakes' severities are derived from the given label hierarchy of downstream tasks. Since CLIP's image encoder is trained with language supervising signals, it implicitly captures the hierarchical semantic relationships between different classes. This motivates our goal of making better mistakes in zero-shot classification, a task for which CLIP is naturally well-suited. Our approach (HAPrompts) queries the language model to produce textual representations for given classes as zero-shot classifiers of CLIP to perform image classification on downstream tasks. To our knowledge, this is the first work to introduce making better mistakes in CLIP-based zero-shot classification. Our approach outperforms the related methods in a holistic comparison across five datasets of varying scales with label hierarchies of different heights in our experiments. Our code and LLM-generated image prompts: \href{https://github.com/ltong1130ztr/HAPrompts}{https://github.com/ltong1130ztr/HAPrompts}.

Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts

TL;DR

This work tackles zero-shot image classification with CLIP by injecting label hierarchy into language prompts. It introduces HAPrompts, a training-free approach that uses two hierarchy-aware prompt families—comparative prompts and path-based prompts—to generate LLM-based image prompts for leaf classes, whose embeddings form class classifiers that interact with image embeddings via . By aggregating embeddings from all generated prompts, HAPrompts better captures hierarchical semantic relationships and reduces mistake severity measured by hierarchical distance, outperforming several baselines across five datasets. The method demonstrates robustness to different LLMs (e.g., ChatGPT, Claude, Gemini) and maintains strong performance with an embedding-space ensemble. Overall, the paper provides a practical, dataset-agnostic, training-free strategy for making better mistakes in CLIP-based zero-shot classification, with code and prompts available on GitHub.

Abstract

Recent studies are leveraging advancements in large language models (LLMs) trained on extensive internet-crawled text data to generate textual descriptions of downstream classes in CLIP-based zero-shot image classification. While most of these approaches aim at improving accuracy, our work focuses on ``making better mistakes", of which the mistakes' severities are derived from the given label hierarchy of downstream tasks. Since CLIP's image encoder is trained with language supervising signals, it implicitly captures the hierarchical semantic relationships between different classes. This motivates our goal of making better mistakes in zero-shot classification, a task for which CLIP is naturally well-suited. Our approach (HAPrompts) queries the language model to produce textual representations for given classes as zero-shot classifiers of CLIP to perform image classification on downstream tasks. To our knowledge, this is the first work to introduce making better mistakes in CLIP-based zero-shot classification. Our approach outperforms the related methods in a holistic comparison across five datasets of varying scales with label hierarchies of different heights in our experiments. Our code and LLM-generated image prompts: \href{https://github.com/ltong1130ztr/HAPrompts}{https://github.com/ltong1130ztr/HAPrompts}.

Paper Structure

This paper contains 20 sections, 2 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overall process of the proposed CLIP-based zero-shot classification. We introduce prior knowledge of the label hierarchy to the language prompts used to query the LLM.
  • Figure 2: Histogram of mistake severities for predictions of CLIP (zero-shot) and ViTs (trained with ImageNet from scratch) on ImageNet. The mistake severities of predictions are derived from our label hierarchy of ImageNet. ViTs trained from scratch tend to make more high-severity mistakes (severity $\ge$ 7 in ImageNet). The ViT models are acquired from PyTorch pretrained weights ViT_weights.
  • Figure 3: Example of a subtree of the ImageNet hierarchy employed in our approach.
  • Figure 4: Holistic Comparison of all methods. We project the evaluation results of each metric/dataset pair to a specific polar axis in the radar chart. We convert Top1 accuracy to error rate so that all metrics are better when smaller (away from the origin). Larger polygons indicate better overall performances. 'IN' represents ImageNet.
  • Figure 5: Holistic comparison of ablation study results.
  • ...and 8 more figures