Table of Contents
Fetching ...

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Songhao Han, Le Zhuo, Yue Liao, Si Liu

TL;DR

The paper tackles ambiguity in LLM-generated class descriptors for vision-language models by proposing a training-free Iterative Optimization with Visual Feedback, where an LLM-guided agent uses a genetic-algorithm-like process to evolve descriptors based on visual feedback from a VLM. By grounding language in CLIP-derived metrics and employing memory banks, the approach dynamically discovers descriptors that maximize image–text alignment, outperforming prior zero-shot and LLM-based methods across nine datasets. The method demonstrates strong generalization across backbones and maintains interpretability of the descriptors, while remaining compatible with fine-tuning pipelines. Overall, the work advances robust, interpretable, and transferable prompt design for visual classification through closed-loop language–vision interaction.

Abstract

Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the reliance on single-turn textual interactions with LLMs, leading to a mismatch between generated text and visual concepts for VLMs; 2) the oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. In this paper, we propose a novel framework that integrates LLMs and VLMs to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors. We demonstrate our optimized descriptors are of high quality which effectively improves classification accuracy on a wide range of benchmarks. Additionally, these descriptors offer explainable and robust features, boosting performance across various backbone models and complementing fine-tuning-based methods.

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

TL;DR

The paper tackles ambiguity in LLM-generated class descriptors for vision-language models by proposing a training-free Iterative Optimization with Visual Feedback, where an LLM-guided agent uses a genetic-algorithm-like process to evolve descriptors based on visual feedback from a VLM. By grounding language in CLIP-derived metrics and employing memory banks, the approach dynamically discovers descriptors that maximize image–text alignment, outperforming prior zero-shot and LLM-based methods across nine datasets. The method demonstrates strong generalization across backbones and maintains interpretability of the descriptors, while remaining compatible with fine-tuning pipelines. Overall, the work advances robust, interpretable, and transferable prompt design for visual classification through closed-loop language–vision interaction.

Abstract

Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the reliance on single-turn textual interactions with LLMs, leading to a mismatch between generated text and visual concepts for VLMs; 2) the oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. In this paper, we propose a novel framework that integrates LLMs and VLMs to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors. We demonstrate our optimized descriptors are of high quality which effectively improves classification accuracy on a wide range of benchmarks. Additionally, these descriptors offer explainable and robust features, boosting performance across various backbone models and complementing fine-tuning-based methods.
Paper Structure (19 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Schematic of the method. (a) Previous methods use an LLM to generate descriptive prompts for each class directly. (b) Our method optimizes class descriptions through an evolutionary process. We utilize a VLM (such as CLIP CLIP) to obtain visual feedback, e.g., the confusion matrix, assessing the quality of current descriptions. Upon building the visual feedback, an LLM generates refined category descriptions, iterating multiple times to achieve the final optimal category descriptions.
  • Figure 2: Illustration of iterative optimization with visual feedback. (a) Given raw class names as input, we first prompt the LLM to generate an initialization of class descriptors. These descriptors undergo an iterative optimization comprising three stages: mutation, where diverse new candidates are generated; crossover, involving mixing and matching across different candidates to produce better candidates; and natural selection, selecting the most suitable candidate based on a fitness function. (b) In each iteration, we compute visual metrics including classification accuracy and confusion matrix for current class descriptors. We further use these metrics to construct visual feedback, update memory banks, and pick the best candidate in natural selection. (c) Through this iterative optimization, the LLM progressively identifies the most effective class descriptors, thereby enhancing the differentiation between ambiguous classes.
  • Figure 3: Ablation on iterative optimizations. X-axis: iteration rounds, Y-axis: Accuracy (%). Red stars represent the accuracy of vanilla CLIP.
  • Figure 4: Examples of interpretability. We select two categories from EuroSAT and Flowers102 and list the top-$3$ and last-$3$ descriptions for each category, ranked by their similarity scores averaging across class test samples.
  • Figure 5: Iterative Optimization Visualization. From left to right, the sequence is as follows: the $0$-th round of our method, the $4$-th round of our method, the $9$-th round of our method, and the CuPL CuPL method. The darker the color in the heatmap, the higher the corresponding value in the confusion matrix.
  • ...and 1 more figures