Table of Contents
Fetching ...

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang

TL;DR

Inspired by the human visual perception process, it is observed that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features, and a training-free, two-step zero-shot classification method PerceptionCLIP is proposed.

Abstract

Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability. Our code is available at https://github.com/umd-huang-lab/perceptionCLIP

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

TL;DR

Inspired by the human visual perception process, it is observed that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features, and a training-free, two-step zero-shot classification method PerceptionCLIP is proposed.

Abstract

Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability. Our code is available at https://github.com/umd-huang-lab/perceptionCLIP
Paper Structure (33 sections, 10 equations, 7 figures, 19 tables, 1 algorithm)

This paper contains 33 sections, 10 equations, 7 figures, 19 tables, 1 algorithm.

Figures (7)

  • Figure 1: (Left): CLIP co-relates natural language descriptions of contextual attributes with visual cues (orientation: upside-down). (Center): Unlike CLIP's standard zero-shot inference that uses fixed template(s) for class name retrieval, our method first infers contextual attributes (background: on the grass) using CLIP and then let CLIP predicts the class conditioned on the inferred contextual attributes. Here, background and orientation are both examples of contextual attributes. (Right): Grad-CAM visualization illustrates that our method focuses more on core features (on the dog) and is less distracted by spurious features (grass background) when performing the object classification.
  • Figure 2: Illustration of contextual attributes, their symbolic discrete values, and the possible textual descriptions mapped by the annotation function.
  • Figure 3: Evaluating CLIP scores on ImageNet with different transformations altering the contextual attributes. The attribute-aware $\mathtt{CLIP}$ score gives higher scores for correctly matched image-attribute pairs (green) while giving lower scores for mismatched pairs (grey) and random pairs (blue), confirming CLIP's understanding of our contextual attribute descriptions. $\mathtt{CLIP}$ score measures the similarity between images and contextual attributes, while the original CLIP score (orange) is attribute-agnostic.
  • Figure 4: Images of a leopard and a waterbird, core and spurious features, and Grad-CAM heatmaps using no, incorrect, and ground-truth contextual attributes (with text below images). The bar shows core vs. spurious ratio in the heatmap. Visualization shows that classification conditioned on correct contextual attributes enforces CLIP's focus on core features.
  • Figure 5: The increase in (left) $\mathtt{CLIP}$ scores and the (right) prediction probabilities by incorporating the descriptions of the correct contextual attribute into the text prompts. We compare the increased $\mathtt{CLIP}$ scores and prediction probabilities for the ground-truth class $y^*$, the Top-5 and Top-10 wrong classes. (left) Incorporating ground-truth attributes into text prompts results in increased $\mathtt{CLIP}$ scores for both correct and incorrect classes. This improvement is attributed to the enhanced alignment of the text prompts with the images, addressing previously overlooked contextual attributes. Notably, the $\mathtt{CLIP}$ score of the correct class benefits more from this enhancement for all the attributes considered. This is because the accurate description of the class, combined with the contextual attributes, achieves a more precise alignment with the corresponding image. (right) Therefore, the model is more likely to predict the correct class after being provided with the correct context description in the prompt.
  • ...and 2 more figures