Table of Contents
Fetching ...

Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance

Hui Liu, Wenya Wang, Kecheng Chen, Jie Liu, Yibing Liu, Tiexin Qin, Peisong He, Xinghao Jiang, Haoliang Li

TL;DR

This paper tackles zero-shot image recognition by enabling vision-language models to reason with human-like concepts. It introduces CHBR, a Concept-guided Human-like Bayesian Reasoning framework that marginalizes over a latent concept space using an LLM-driven importance sampler and discriminative tests to assign priors, paired with three test-time likelihoods (Average, Confidence, and TTA) to adapt to individual images. Across 15 datasets, CHBR consistently outperforms strong zero-shot baselines, with notable gains in fine-grained tasks and robustness to distribution shifts, while offering flexible and training-free inference for two of the likelihood variants. The work advances practical zero-shot generalization by combining concept discovery, prior elicitation from LLMs, and adaptive likelihoods, paving the way for plug-and-play concept enrichment in Vision-Language Models.

Abstract

In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.

Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance

TL;DR

This paper tackles zero-shot image recognition by enabling vision-language models to reason with human-like concepts. It introduces CHBR, a Concept-guided Human-like Bayesian Reasoning framework that marginalizes over a latent concept space using an LLM-driven importance sampler and discriminative tests to assign priors, paired with three test-time likelihoods (Average, Confidence, and TTA) to adapt to individual images. Across 15 datasets, CHBR consistently outperforms strong zero-shot baselines, with notable gains in fine-grained tasks and robustness to distribution shifts, while offering flexible and training-free inference for two of the likelihood variants. The work advances practical zero-shot generalization by combining concept discovery, prior elicitation from LLMs, and adaptive likelihoods, paving the way for plug-and-play concept enrichment in Vision-Language Models.

Abstract

In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.

Paper Structure

This paper contains 19 sections, 9 equations, 7 figures, 11 tables, 2 algorithms.

Figures (7)

  • Figure 1: The pipeline of the concept-based human-like reasoning process for zero-shot image recognition. When the image depicts a shark on the ocean floor, the weight assigned to the concept scene is reduced, as humans can recognize that sharks are not exclusively on the sea but can also be found in deeper ocean environments.
  • Figure 2: Bayesian network that conceptualizes human-like zero-shot image recognition process.
  • Figure 3: The difference in averaged top-1 accuracy between CHBR, inestantiated with various visual encoders and the baseline CLIP across ten fine-grained image recognition tasks. Hyperparameters are kept consistent across all experiments.
  • Figure 4: Top-1 Accuracy difference between discriminative and descriptive concepts on ten fine-grained datasets. We use Averaged Likelihood for illustration.
  • Figure 5: Visualization of the contribution of visual and text tokens to the final prediction for the target class prompt, distractor class prompt, and concept-based target class prompt. The distractor class refers to the set of classes into which the original image of the target class will be misclassified. In the visual heatmap, the higher intensity in red indicates a stronger influence on the prediction. For text tokens, darker shades of green correspond to higher contributions. Numerical values following the prompts are predicted probabilities.
  • ...and 2 more figures