Table of Contents
Fetching ...

Bridging Explainability and Embeddings: BEE Aware of Spuriousness

Cristian Daniel Păduraru, Antonio Bărbălau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu

TL;DR

This work tackles the problem of spurious correlations learned during fine-tuning that escape detection by standard evaluation. It introduces BEE, a weight-space framework that tracks how a classifier's weights drift from zero-shot class embeddings toward spuriously correlated concepts by leveraging embedding geometry and linear probing. The authors demonstrate that the identified spurious correlations persist after full fine-tuning and transfer across diverse backbones, degrading performance in ImageNet-1k and manifesting in medical notes (MIMIC-CXR), among others, with controlled validation experiments using generative models. Overall, BEE provides a general, principled tool for diagnosing and naming spurious correlations, enabling principled dataset auditing and contributing to more trustworthy foundation models, with public code release.

Abstract

Current methods for detecting spurious correlations rely on analyzing dataset statistics or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space, and to the embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2). BEE consistently exposes spurious correlations: from concepts that slash the ImageNet accuracy by up to 95%, to clinical shortcuts in MIMIC-CXR notes that induce dangerous false negatives. Together, these results position BEE as a general and principled tool for diagnosing spurious correlations in weight space, enabling principled dataset auditing and more trustworthy foundation models. The source code is publicly available at https://github.com/bit-ml/bee.

Bridging Explainability and Embeddings: BEE Aware of Spuriousness

TL;DR

This work tackles the problem of spurious correlations learned during fine-tuning that escape detection by standard evaluation. It introduces BEE, a weight-space framework that tracks how a classifier's weights drift from zero-shot class embeddings toward spuriously correlated concepts by leveraging embedding geometry and linear probing. The authors demonstrate that the identified spurious correlations persist after full fine-tuning and transfer across diverse backbones, degrading performance in ImageNet-1k and manifesting in medical notes (MIMIC-CXR), among others, with controlled validation experiments using generative models. Overall, BEE provides a general, principled tool for diagnosing and naming spurious correlations, enabling principled dataset auditing and contributing to more trustworthy foundation models, with public code release.

Abstract

Current methods for detecting spurious correlations rely on analyzing dataset statistics or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space, and to the embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2). BEE consistently exposes spurious correlations: from concepts that slash the ImageNet accuracy by up to 95%, to clinical shortcuts in MIMIC-CXR notes that induce dangerous false negatives. Together, these results position BEE as a general and principled tool for diagnosing spurious correlations in weight space, enabling principled dataset auditing and more trustworthy foundation models. The source code is publicly available at https://github.com/bit-ml/bee.

Paper Structure

This paper contains 46 sections, 6 equations, 6 figures, 28 tables, 1 algorithm.

Figures (6)

  • Figure 1: Qualitative results with BEE for CLIP ViT-L/14 fine-tuned on ImageNet-1k. Although a (REAL) class is clearly depicted, adding an object tied to a spurious concept (SC) flips the prediction to a (PREDICTED) class absent from the image, leading to unexpected and unwanted behavior.
  • Figure 1: SC-enhanced zero-shot prompts. Following B2T, we inject SCs into zero-shot prompts, leveraging richer descriptions to improve classification. SCs identified by BEE significantly boost worst-group accuracy across image and text datasets.
  • Figure 2: Following BEE's steps for the ImageNet-1k "Fire Truck" class. In Step 1, during training, the classification weights $W$ drift from the initial class concept embedding $W^0$, outside the scope of relevant concepts, towards spuriously correlated ones. In Step 2, our method filters out class-related concepts and, using an embedding-space scoring system, ranks and automatically marks the highest-ranking class-neutral concepts as SCs.
  • Figure 3: The maximum distance between the reference line and the smoothed scores gives the threshold for our cut-off heuristic.
  • Figure 4: Correlation(sample_loss, sample_to_bias similarity) under ERM/GDRO after one epoch of training on Waterbirds. Loss correlation w/ biases, ERM vs GroupDRO using groups created with the B2T partitioning method. It can be seen that, when training with ERM, loss value is highly correlated with the biases. In contrast, GroupDRO reduces the correlations, intuitively showing that biases discovered with our method are closely related to the ground truth groups of the dataset, being used as shortcuts by the model unless mitigated.
  • ...and 1 more figures