Bridging Explainability and Embeddings: BEE Aware of Spuriousness
Cristian Daniel Păduraru, Antonio Bărbălau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu
TL;DR
This work tackles the problem of spurious correlations learned during fine-tuning that escape detection by standard evaluation. It introduces BEE, a weight-space framework that tracks how a classifier's weights drift from zero-shot class embeddings toward spuriously correlated concepts by leveraging embedding geometry and linear probing. The authors demonstrate that the identified spurious correlations persist after full fine-tuning and transfer across diverse backbones, degrading performance in ImageNet-1k and manifesting in medical notes (MIMIC-CXR), among others, with controlled validation experiments using generative models. Overall, BEE provides a general, principled tool for diagnosing and naming spurious correlations, enabling principled dataset auditing and contributing to more trustworthy foundation models, with public code release.
Abstract
Current methods for detecting spurious correlations rely on analyzing dataset statistics or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space, and to the embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2). BEE consistently exposes spurious correlations: from concepts that slash the ImageNet accuracy by up to 95%, to clinical shortcuts in MIMIC-CXR notes that induce dangerous false negatives. Together, these results position BEE as a general and principled tool for diagnosing spurious correlations in weight space, enabling principled dataset auditing and more trustworthy foundation models. The source code is publicly available at https://github.com/bit-ml/bee.
