Table of Contents
Fetching ...

Self-Evolving Visual Concept Library using Vision-Language Critics

Atharva Sehgal, Patrick Yuan, Ziniu Hu, Yisong Yue, Jennifer J. Sun, Swarat Chaudhuri

TL;DR

Escher introduces a self-evolving framework that builds a visual concept library by using a vision-language critic to feedback into an LLM-driven concept generator. It alternates between concept-bottleneck optimization and history-conditioned concept evolution, enabling open-vocabulary concepts to be iteratively refined without human labels. Across seven datasets and multiple CBMs, Escher consistently improves zero-shot, few-shot, and fine-tuned classification by leveraging a history-enabled prompt strategy and a VLM-based feedback signal. The approach demonstrates robustness to backbone variations and offers a plug-and-play solution for advancing interpretable, discriminative visual concept learning.

Abstract

We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.

Self-Evolving Visual Concept Library using Vision-Language Critics

TL;DR

Escher introduces a self-evolving framework that builds a visual concept library by using a vision-language critic to feedback into an LLM-driven concept generator. It alternates between concept-bottleneck optimization and history-conditioned concept evolution, enabling open-vocabulary concepts to be iteratively refined without human labels. Across seven datasets and multiple CBMs, Escher consistently improves zero-shot, few-shot, and fine-tuned classification by leveraging a history-enabled prompt strategy and a VLM-based feedback signal. The approach demonstrates robustness to backbone variations and offers a plug-and-play solution for advancing interpretable, discriminative visual concept learning.

Abstract

We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic's feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.

Paper Structure

This paper contains 56 sections, 3 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of Escher. Prior work: concept-bottleneck visual recognition aims to leverage discriminative visual concepts to enable more accurate object classification. Ours: Escher is an approach for iteratively evolving a visual concept library using feedback from a VLM critic, to discover more effective visual concepts.
  • Figure 2: (Left) Existing work on concept-bottleneck visual recognition, where a VLM scores a set of concepts to perform classification. The classification is based on the class with the maximum concept scores. (Right) Escher. (1) Escher follows previous work DBLP:conf/iclr/MenonV23 in instantiating a set of concepts for each class using an LLM. (2) It initializes a concept-bottleneck model and collect the predictions for a classification dataset $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N$ (labels optional). (3) A concept similarity heuristic identifies frequently confused classes. (4) A history bank then stores relevant information to guide (5) the LLM sampling procedure for improved concepts that disambiguate these classes. The new concepts are integrated into the next iteration.
  • Figure 3: A qualitative example of evolving concepts with CbD+Escher in NABirds. Initially, the model is confused between two similar categories with almost the same mean CLIP activation indicating that the concepts provide a coarse categorization signal, but miss subtle nuances. After training with Escher, the feedback mechanism identifies new characteristic features (e.g. metallic green head and neck) enabling the correct classification. Additional examples are provided in § 8.6.
  • Figure 4: Qualitative analysis of Pearson's Correlation disambiguation metric for the 10 most underperforming classes for CIFAR-100, CUB, and Food101. Escher's heuristic does not require any human annotations, yet accurately approximates inter-class confusion. However, this heuristic is often over-sensitive to minute errors and is symmetric, leading to slightly suboptimal disambiguation.
  • Figure 5: Qualitative analysis of the Top-k pseudo-confusion disambiguation heuristic calculated for the 10 most underperforming classes for CIFAR-100, CUB, and Food101. After computing the top-k class scores $\text{topk}(\hat{\mathbf{y}})[:, :k]$, we compute the confusion matrix by incrementing the $(i, j)$ value if $y_i$ and $y_j$ class occur in the top-k entries for an image. Top-k pseudo-confusion tends to be sensitive to minute errors which leads to slight sub-optimality.
  • ...and 2 more figures