Table of Contents
Fetching ...

Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

TL;DR

This work addresses spurious correlations in CLIP-like vision–language models by introducing Cluster-based Concept Importance (CCI), a training-free interpretability method that clusters patch embeddings into semantically meaningful concepts and masks them to assess their effect on image–text similarity. It also presents COVAR, a large, controlled benchmark that systematically varies background, viewpoint, scale, and other factors to disentangle background reliance from other error sources. CCI achieves state-of-the-art faithfulness on deletion/insertion metrics and, when combined with GroundedSAM, reveals limitations of accuracy-based benchmarks like CounterAnimals. Across 18 CLIP variants, COVAR uncovers substantial, factor-dependent robustness gaps—most notably in scale—providing concrete guidance for data curation, architectural improvements, and fine-grained supervision to enhance robustness of vision–language models.

Abstract

Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.

Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

TL;DR

This work addresses spurious correlations in CLIP-like vision–language models by introducing Cluster-based Concept Importance (CCI), a training-free interpretability method that clusters patch embeddings into semantically meaningful concepts and masks them to assess their effect on image–text similarity. It also presents COVAR, a large, controlled benchmark that systematically varies background, viewpoint, scale, and other factors to disentangle background reliance from other error sources. CCI achieves state-of-the-art faithfulness on deletion/insertion metrics and, when combined with GroundedSAM, reveals limitations of accuracy-based benchmarks like CounterAnimals. Across 18 CLIP variants, COVAR uncovers substantial, factor-dependent robustness gaps—most notably in scale—providing concrete guidance for data curation, architectural improvements, and fine-grained supervision to enhance robustness of vision–language models.

Abstract

Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP's own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.

Paper Structure

This paper contains 26 sections, 2 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: (a) Image and CCI maps for CLIP’s prediction (red–blue heatmap, red = stronger attention). (b) Samples from easy and hard sets of CA. (c) Example of dataset curation in COVAR. (d) Proportion of different sources of errors in CA, ImageNet and COVAR subsets.
  • Figure 2: Patch clusters.
  • Figure 3: Qualitative comparison of CCI against baseline interpretability methods.
  • Figure 4: Deletion and insertion curves demonstrating CCI's quantitative superiority in identifying decision-relevant regions.
  • Figure 5: CCI analysis of CLIP failures on (a) ImageNet and (b) Counter Animals. Rows show the image, ground-truth foreground (FG) mask obtained using GroundedSAM, and attribution maps from MaskCLIP, Grad-ECLIP, and CCI, with ground-truth (GT) and predicted (Pred) labels below.
  • ...and 16 more figures