Table of Contents
Fetching ...

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

Tang Li, Mengmeng Ma, Xi Peng

TL;DR

Large pre-trained Vision-Language Models (VLMs) often entangle fine-grained concepts, leading to entangled and mislocalized explanations. The authors introduce DEAL, a plug-in, self-supervised framework that prompts Large Language Models to generate discriminative concepts, computes post-hoc heatmap explanations, and enforces both disentanglement among concept explanations and localization consistency with category explanations via a constrained objective Risk(f) = E_{(I,T)}[ L_contr(f(I,T)) ] + λ R_disen + γ R_local. By optimizing this objective with Lagrange multipliers, DEAL achieves superior concept-level disentanglability and localizability without altering model architectures, while also improving prediction accuracy across diverse datasets and backbones. Extensive ablations and ground-truth-part evaluations confirm the necessity of both constraints, and additional results demonstrate strong per-image and per-concept explainability as well as robust retrieval capabilities. The approach reduces reliance on spurious correlations and provides human-understandable concept-level explanations, offering practical benefits for safety-critical and generalization-sensitive applications.

Abstract

Large pre-trained Vision-Language Models (VLMs) have become ubiquitous foundational components of other models and downstream tasks. Although powerful, our empirical results reveal that such models might not be able to identify fine-grained concepts. Specifically, the explanations of VLMs with respect to fine-grained concepts are entangled and mislocalized. To address this issue, we propose to DisEntAngle and Localize (DEAL) the concept-level explanations for VLMs without human annotations. The key idea is encouraging the concept-level explanations to be distinct while maintaining consistency with category-level explanations. We conduct extensive experiments and ablation studies on a wide range of benchmark datasets and vision-language models. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability. Surprisingly, the improved explainability alleviates the model's reliance on spurious correlations, which further benefits the prediction accuracy.

DEAL: Disentangle and Localize Concept-level Explanations for VLMs

TL;DR

Large pre-trained Vision-Language Models (VLMs) often entangle fine-grained concepts, leading to entangled and mislocalized explanations. The authors introduce DEAL, a plug-in, self-supervised framework that prompts Large Language Models to generate discriminative concepts, computes post-hoc heatmap explanations, and enforces both disentanglement among concept explanations and localization consistency with category explanations via a constrained objective Risk(f) = E_{(I,T)}[ L_contr(f(I,T)) ] + λ R_disen + γ R_local. By optimizing this objective with Lagrange multipliers, DEAL achieves superior concept-level disentanglability and localizability without altering model architectures, while also improving prediction accuracy across diverse datasets and backbones. Extensive ablations and ground-truth-part evaluations confirm the necessity of both constraints, and additional results demonstrate strong per-image and per-concept explainability as well as robust retrieval capabilities. The approach reduces reliance on spurious correlations and provides human-understandable concept-level explanations, offering practical benefits for safety-critical and generalization-sensitive applications.

Abstract

Large pre-trained Vision-Language Models (VLMs) have become ubiquitous foundational components of other models and downstream tasks. Although powerful, our empirical results reveal that such models might not be able to identify fine-grained concepts. Specifically, the explanations of VLMs with respect to fine-grained concepts are entangled and mislocalized. To address this issue, we propose to DisEntAngle and Localize (DEAL) the concept-level explanations for VLMs without human annotations. The key idea is encouraging the concept-level explanations to be distinct while maintaining consistency with category-level explanations. We conduct extensive experiments and ablation studies on a wide range of benchmark datasets and vision-language models. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability. Surprisingly, the improved explainability alleviates the model's reliance on spurious correlations, which further benefits the prediction accuracy.
Paper Structure (21 sections, 6 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: We visualize the CLIP radford2021learning model's explanations for fine-grained concepts. The comparison is conducted using Chefer et al. Chefer_2021_ICCV explanation method on the ViT-B/32 vision backbone. ( Left) CLIP's explanations w.r.t. visually distinct concepts might highlight the same region, or even mislocalize concepts to spurious factors, e.g., background. In contrast, the concept-level explanations of our model are well-disentangled and correctly localized. ( Right) Our method significantly improves the model's concept-level explanations in terms of disentanglability and localizability compared to state-of-the-art VLMs on different benchmark datasets. This figure is best viewed in color.
  • Figure 1: Disentanglability comparison using images from the ImageNetdeng2009imagenet dataset. This figure is best viewed in color.
  • Figure 2: Overview of the proposed method to Disentangle and Localize (DEAL) concept-level explanations for VLMs. Our method consists of the concept-level explanation disentanglement and localization constraints for training Vision-Language Models (VLMs). First, we query the Large Language Model (LLM), e.g., GPT-3.5 peng2023gpt35turbo, with the category name to obtain discriminative visual concepts for distinguishing the category. Then, we feed the category and concepts as a sentence into the text encoder, and the image into the image encoder to calculate the standard contrastive learning loss ($\mathcal{L}_\mathrm{contr}$). Next, we calculate the explanations w.r.t. the category name and each of the concepts within the category. We constrain the disentanglement between the explanations of all concepts within the category ($\mathcal{R}_\mathrm{disen}$). Finally, we constrain localization by the consistency between the category-level explanation and the aggregation of concept-level explanations ($\mathcal{R}_\mathrm{local}$). This figure is best viewed in color.
  • Figure 2: Localizability comparison per concept using images from the ImageNetdeng2009imagenet dataset. This figure is best viewed in color.
  • Figure 3: Directly query GPT-3.5 peng2023gpt35turbo model would yield concepts that are not visually measurable. In contrast, our concepts are visually distinctable. The bold words are the category names in ImageNetdeng2009imagenet, and the list follows are the generated concepts.
  • ...and 3 more figures