Table of Contents
Fetching ...

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Divyansh Srivastava, Ge Yan, Tsui-Wei Weng

TL;DR

VLG-CBM tackles two key problems in concept bottleneck models: non-visual, faithfulness-violating concept predictions and information leakage that undermines interpretability. By grounding concept predictions with open-domain detectors and grounding the CBM training in a visually guided auxiliary dataset, the approach yields more faithful concept attributions and improved predictive performance. The paper introduces NEC, a sparsity-based metric, and provides theoretical analysis showing how random CBLs can suffice only with many concepts, motivating NEC as a fairness/interpretability tool. Empirical results across five benchmarks demonstrate consistent gains over prior CBMs, with strong interpretability evidenced by CBL neuron visualizations and case studies. The work offers a practical path to scalable, faithful interpretable models in vision tasks, while highlighting limitations related to large pretrained components and suggesting future avenues like segmentation-guided concept grounding.

Abstract

Concept Bottleneck Models (CBMs) provide interpretable prediction by introducing an intermediate Concept Bottleneck Layer (CBL), which encodes human-understandable concepts to explain models' decision. Recent works proposed to utilize Large Language Models and pre-trained Vision-Language Models to automate the training of CBMs, making it more scalable and automated. However, existing approaches still fall short in two aspects: First, the concepts predicted by CBL often mismatch the input image, raising doubts about the faithfulness of interpretation. Second, it has been shown that concept values encode unintended information: even a set of random concepts could achieve comparable test accuracy to state-of-the-art CBMs. To address these critical limitations, in this work, we propose a novel framework called Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) to enable faithful interpretability with the benefits of boosted performance. Our method leverages off-the-shelf open-domain grounded object detectors to provide visually grounded concept annotation, which largely enhances the faithfulness of concept prediction while further improving the model performance. In addition, we propose a new metric called Number of Effective Concepts (NEC) to control the information leakage and provide better interpretability. Extensive evaluations across five standard benchmarks show that our method, VLG-CBM, outperforms existing methods by at least 4.27% and up to 51.09% on Accuracy at NEC=5 (denoted as ANEC-5), and by at least 0.45% and up to 29.78% on average accuracy (denoted as ANEC-avg), while preserving both faithfulness and interpretability of the learned concepts as demonstrated in extensive experiments.

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

TL;DR

VLG-CBM tackles two key problems in concept bottleneck models: non-visual, faithfulness-violating concept predictions and information leakage that undermines interpretability. By grounding concept predictions with open-domain detectors and grounding the CBM training in a visually guided auxiliary dataset, the approach yields more faithful concept attributions and improved predictive performance. The paper introduces NEC, a sparsity-based metric, and provides theoretical analysis showing how random CBLs can suffice only with many concepts, motivating NEC as a fairness/interpretability tool. Empirical results across five benchmarks demonstrate consistent gains over prior CBMs, with strong interpretability evidenced by CBL neuron visualizations and case studies. The work offers a practical path to scalable, faithful interpretable models in vision tasks, while highlighting limitations related to large pretrained components and suggesting future avenues like segmentation-guided concept grounding.

Abstract

Concept Bottleneck Models (CBMs) provide interpretable prediction by introducing an intermediate Concept Bottleneck Layer (CBL), which encodes human-understandable concepts to explain models' decision. Recent works proposed to utilize Large Language Models and pre-trained Vision-Language Models to automate the training of CBMs, making it more scalable and automated. However, existing approaches still fall short in two aspects: First, the concepts predicted by CBL often mismatch the input image, raising doubts about the faithfulness of interpretation. Second, it has been shown that concept values encode unintended information: even a set of random concepts could achieve comparable test accuracy to state-of-the-art CBMs. To address these critical limitations, in this work, we propose a novel framework called Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) to enable faithful interpretability with the benefits of boosted performance. Our method leverages off-the-shelf open-domain grounded object detectors to provide visually grounded concept annotation, which largely enhances the faithfulness of concept prediction while further improving the model performance. In addition, we propose a new metric called Number of Effective Concepts (NEC) to control the information leakage and provide better interpretability. Extensive evaluations across five standard benchmarks show that our method, VLG-CBM, outperforms existing methods by at least 4.27% and up to 51.09% on Accuracy at NEC=5 (denoted as ANEC-5), and by at least 0.45% and up to 29.78% on average accuracy (denoted as ANEC-avg), while preserving both faithfulness and interpretability of the learned concepts as demonstrated in extensive experiments.
Paper Structure (37 sections, 3 theorems, 24 equations, 19 figures, 7 tables)

This paper contains 37 sections, 3 theorems, 24 equations, 19 figures, 7 tables.

Key Result

Theorem 4.1

Suppose $\Sigma \in \mathbb{R}^{d \times d}$ is the variance matrix of the representation $z$ which is positive definite, $\lambda_{max}$ is the largest eigenvalue of $\Sigma$, and the weight matrix $W_c \in \mathbb{R}^{k \times d}$ is sampled i.i.d from a standard Gaussian distribution. For any lin Here $E(k) = \mathbb{E}_{W_c} \left[\min_{(\tilde{w}, \tilde{b})} \mathbb E_z\; \left[ |f(z) - \til

Figures (19)

  • Figure 1: We compare the decision explanation of VLG-CBM with existing methods by listing top-5 contributions for their decisions. Our observations include: (1) VLG-CBM provides concise and accurate concept attribution for the decision; (2) LF-CBM oikarinen2023label frequently uses negative concepts for explanation, which is less informative; (3) LM4CVyan2023learning attributes the decision to concepts that do not match the images, a reason for this is that LM4CV uses a limited number of concepts, which hurts CBM's ability to explain diverse images; (4) Both LF-CBM and LM4CV have a significant portion of contribution from non-top concepts, making decisions less transparent. Full figure is in Appendix Fig. \ref{['fig1:zoomed_image']}.
  • Figure 2: VLG-CBM pipeline: We design automated Vision+Language Guided approach to train Concept Bottleneck Models.
  • Figure 3: Accuracy comparison between our VLG-CBM, LF-CBMoikarinen2023label and randomly initialized concept bottleneck layer under different NEC. The experiment is conducted on the CIFAR10 dataset. From the results, we could see that (1) for NEC large enough, even a random CBL could achieve near-optimal accuracy, supporting the existence of information leakage; (2) when NEC decreases, the accuracy of LF-CBM and random weights begin to drop, while our VLG-CBM does not have significant decrease.
  • Figure 4: Top-5 activated images of example concepts neurons in VLG-CBM on CUB dataset.
  • Figure D.1: Full version of Fig \ref{['fig:CaseStudyMain']} comparing explanation of LF-CBM and LM4CV with VLG-CBM (ours)
  • ...and 14 more figures

Theorems & Definitions (6)

  • Theorem 4.1
  • Remark 4.1
  • Theorem A.1
  • proof
  • Corollary A.1
  • Remark A.2