Table of Contents
Fetching ...

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

Abstract

Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Abstract

Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.
Paper Structure (28 sections, 2 equations, 9 figures, 13 tables)

This paper contains 28 sections, 2 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: The top-selected text prompts by CLIP given a query image, VDES is proposed by menon2022visual, whereas CDL is our proposed approach. The design of concept prompts plays a critical role on understanding whether VLMs learn visual concepts. We can observe that concept-augumented prompt can predict correct visual concepts (e.g., gray back and wings) when the prompt is associated with the category name (California seagull). When the category name is removed from the prompt (Column 2), the retrieved concepts are either non-visual or incorrect. We attribute this to the category name bias (Column 3), as the correct category can be retrieved by CLIP even when the paired descriptions are randomly shuffled and thus irrelevant. We propose a concept discovery and learning (CDL) framework and demonstrate that pre-trained VLMs can indeed learn visual concepts (e.g., Column 4). Correctly predicted concepts are in green, wrong concepts are in red, and non-visual concepts are in violet. Category names are in orange.
  • Figure 2: Illustration of our proposed concept discovery method. Given image-caption pairs, we first identify objects from the captions and utilize a large language model to propose candidate concepts for the objects. The concepts are then ranked by the agreement between VLM knowledge (concept recognition from the image) and LLM knowledge (concept proposed based on the caption) based on mutual information.
  • Figure 3: Illustration of the concept-based object recognition framework and our proposed concept learning method. We map the concept activations $\mathbf{a}$ to categories with the concept-category association matrix $\mathcal{W}$. For object recognition, only $\mathcal{W}$ is optimized based on object classification supervision. For concept learning, we assume $\mathcal{W}$ is a binary matrix given by LLM knowledge, and learns to update $\mathbf{a}$ by fine-tuning the last layers of visual and text encoders in the VLM. We rely on the VLM recognized object labels as opposed to ground truth object labels, hence the process is self-supervised.
  • Figure 4: Few-shot classification evaluation with LaBo and our method.
  • Figure 5: Human evaluation of the discovered concepts for their interpretability, precision, and thoroughness. We evaluate the concepts selected without and with concept learning (denoted as "CD" and "CDL").
  • ...and 4 more figures