Table of Contents
Fetching ...

From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation

Ran Eisenberg, Amit Rozner, Ethan Fetaya, Ofir Lindenbaum

TL;DR

SEG-MIL-CBM tackles the interpretability challenge in vision by grounding predictions in semantically meaningful image regions. It combines a CLIP-guided concept segmentation pipeline with an attention-based MIL that treats each segment as an instance and aligns segment concepts with CLIP cues, producing spatially grounded, concept-level explanations without needing concept annotations. Empirically, it improves worst-group accuracy under spurious correlations, maintains strong performance on standard benchmarks, and demonstrates robustness to common corruptions, while offering interpretable region-level reasoning. This approach bridges interpretability and robustness for open-world vision systems, with potential applicability to safety-critical tasks where regional evidence and concept alignment are crucial.

Abstract

Deep neural networks have achieved remarkable success in computer vision; however, their black-box nature in decision-making limits interpretability and trust, particularly in safety-critical applications. Interpretability is crucial in domains where errors have severe consequences. Existing models not only lack transparency but also risk exploiting unreliable or misleading features, which undermines both robustness and the validity of their explanations. Concept Bottleneck Models (CBMs) aim to improve transparency by reasoning through human-interpretable concepts. Still, they require costly concept annotations and lack spatial grounding, often failing to identify which regions support each concept. We propose SEG-MIL-CBM, a novel framework that integrates concept-guided image segmentation into an attention-based multiple instance learning (MIL) framework, where each segmented region is treated as an instance and the model learns to aggregate evidence across them. By reasoning over semantically meaningful regions aligned with high-level concepts, our model highlights task-relevant evidence, down-weights irrelevant cues, and produces spatially grounded, concept-level explanations without requiring annotations of concepts or groups. SEG-MIL-CBM achieves robust performance across settings involving spurious correlations (unintended dependencies between background and label), input corruptions (perturbations that degrade visual quality), and large-scale benchmarks, while providing transparent, concept-level explanations.

From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation

TL;DR

SEG-MIL-CBM tackles the interpretability challenge in vision by grounding predictions in semantically meaningful image regions. It combines a CLIP-guided concept segmentation pipeline with an attention-based MIL that treats each segment as an instance and aligns segment concepts with CLIP cues, producing spatially grounded, concept-level explanations without needing concept annotations. Empirically, it improves worst-group accuracy under spurious correlations, maintains strong performance on standard benchmarks, and demonstrates robustness to common corruptions, while offering interpretable region-level reasoning. This approach bridges interpretability and robustness for open-world vision systems, with potential applicability to safety-critical tasks where regional evidence and concept alignment are crucial.

Abstract

Deep neural networks have achieved remarkable success in computer vision; however, their black-box nature in decision-making limits interpretability and trust, particularly in safety-critical applications. Interpretability is crucial in domains where errors have severe consequences. Existing models not only lack transparency but also risk exploiting unreliable or misleading features, which undermines both robustness and the validity of their explanations. Concept Bottleneck Models (CBMs) aim to improve transparency by reasoning through human-interpretable concepts. Still, they require costly concept annotations and lack spatial grounding, often failing to identify which regions support each concept. We propose SEG-MIL-CBM, a novel framework that integrates concept-guided image segmentation into an attention-based multiple instance learning (MIL) framework, where each segmented region is treated as an instance and the model learns to aggregate evidence across them. By reasoning over semantically meaningful regions aligned with high-level concepts, our model highlights task-relevant evidence, down-weights irrelevant cues, and produces spatially grounded, concept-level explanations without requiring annotations of concepts or groups. SEG-MIL-CBM achieves robust performance across settings involving spurious correlations (unintended dependencies between background and label), input corruptions (perturbations that degrade visual quality), and large-scale benchmarks, while providing transparent, concept-level explanations.

Paper Structure

This paper contains 20 sections, 3 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of Concept Bottleneck Models (CBM) versus our proposed SEG-MIL-CBM. (a) CBMs predict labels using concept bottleneck layers, which are derived at the global image level. (b) SEG-MIL-CBM first segments the image into semantically meaningful regions and treats each as an instance in an attention-based multiple instance learning framework. This enables the model to identify task-relevant regions, down-weight irrelevant cues, and provide concept-level explanations that are both interpretable and spatially grounded.
  • Figure 2: Overview of our concept-guided segmentation pipeline. Given an input image, CLIP radford2021learning Image Encoder extracts image embeddings while a concept set is encoded by CLIP radford2021learning Text Encoder. The top-$K_{\text{top}}$ concepts most relevant to the image are selected by cosine-similarity scores. They are then used with GroundingDINO liu2024grounding and SAM kirillov2023segment to produce semantically meaningful segments. Each segment is annotated with concepts (e.g., “yellowish breast”, “black throat”) and their corresponding scores ${\hbox{\boldmath $z$}}_i^{\mathrm{CLIP}}$.
  • Figure 3: Overview of the SEG-MIL-CBM training pipeline. Each input image is decomposed into concept-guided segments $\{s_1, \dots, s_{N_s}\}$, which are passed through a shared backbone to produce features ${\hbox{\boldmath $h$}}_i = \phi({\hbox{\boldmath $s$}}_i)$. These features are projected into a concept space via ${\hbox{\boldmath $Z$}} = {\hbox{\boldmath $W$}}_c {\hbox{\boldmath $H$}}$, and segment-level activations are aligned with CLIP-derived similarity vectors using a similarity-based concept loss. An attention mechanism assigns weights $\alpha_i$ to each segment, allowing the model to aggregate concept activations into a weighted representation ${\hbox{\boldmath $c$}}_{\text{agg}}$, which is then fed to the classifier head. The total training objective combines image-level classification loss with the similarity-based concept loss, encouraging both predictive performance and semantic interpretability.
  • Figure 4: 95% confidence interval accuracy across 5 CIFAR-10-C corruptions (frost, gaussian blur, gaussian noise, shot noise, zoom blur) for: Vanilla ResNet-50 (pretrained on CIFAR-10), SEG-MIL-CBM, Label-Free CBM oikarinenlabel, and Post-hoc CBM yuksekgonul2023posthoc.
  • Figure 5: Accuracy trends across corruption types (page 1)
  • ...and 1 more figures