Table of Contents
Fetching ...

When Text and Images Don't Mix: Bias-Correcting Language-Image Similarity Scores for Anomaly Detection

Adam Goodge, Bryan Hooi, Wee Siong Ng

TL;DR

This paper investigates semantic anomaly detection with CLIP and uncovers a text clustering effect where text embeddings cluster together away from image embeddings, creating a similarity bias that deteriorates detection performance. It introduces BLISS, a simple, inference-only scoring scheme that combines an internal class score with an external text score derived from a broad dictionary to correct for bias. BLISS achieves state-of-the-art AUROC on CIFAR-10, CIFAR-100, and TinyImageNet across multiple splits, with robustness to dictionary size, backbone VLM, and few-shot normal data scenarios. The work highlights the need to understand and address multi-modal latent space peculiarities to ensure reliable downstream performance.

Abstract

Contrastive Language-Image Pre-training (CLIP) achieves remarkable performance in various downstream tasks through the alignment of image and text input embeddings and holds great promise for anomaly detection. However, our empirical experiments show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective to align image-text input pairs. We show that this phenomenon induces a `similarity bias' - in which false negative and false positive errors occur due to bias in the similarities between images and the normal label text embeddings. To address this bias, we propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs. BLISS is simple, it does not require strong inductive biases about anomalous behaviour nor an expensive training process, and it significantly outperforms baseline methods on benchmark image datasets, even when access to normal data is extremely limited.

When Text and Images Don't Mix: Bias-Correcting Language-Image Similarity Scores for Anomaly Detection

TL;DR

This paper investigates semantic anomaly detection with CLIP and uncovers a text clustering effect where text embeddings cluster together away from image embeddings, creating a similarity bias that deteriorates detection performance. It introduces BLISS, a simple, inference-only scoring scheme that combines an internal class score with an external text score derived from a broad dictionary to correct for bias. BLISS achieves state-of-the-art AUROC on CIFAR-10, CIFAR-100, and TinyImageNet across multiple splits, with robustness to dictionary size, backbone VLM, and few-shot normal data scenarios. The work highlights the need to understand and address multi-modal latent space peculiarities to ensure reliable downstream performance.

Abstract

Contrastive Language-Image Pre-training (CLIP) achieves remarkable performance in various downstream tasks through the alignment of image and text input embeddings and holds great promise for anomaly detection. However, our empirical experiments show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective to align image-text input pairs. We show that this phenomenon induces a `similarity bias' - in which false negative and false positive errors occur due to bias in the similarities between images and the normal label text embeddings. To address this bias, we propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs. BLISS is simple, it does not require strong inductive biases about anomalous behaviour nor an expensive training process, and it significantly outperforms baseline methods on benchmark image datasets, even when access to normal data is extremely limited.
Paper Structure (19 sections, 11 equations, 4 figures, 3 tables)

This paper contains 19 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) The naïve expectation of CLIP's latent space based on its contrastive objective. Text labels which are not semantically similar are separated from each other and the corresponding images cluster around them. (b) t-SNE projections of the true embeddings learnt by CLIP. Text labels (crosses) are highly clustered together, far away from images (dots).
  • Figure 2: Average cosine similarities of the normalized CLIP embeddings of CIFAR-10 text class labels to the dictionary (blue), and to their associated CIFAR-10 image embeddings (orange).
  • Figure 3: Proportion of false negative (blue) and false positive (orange) errors in each quantile of samples sorted by average similarity to the dictionary.
  • Figure 4: Scatter plot of internal class scores (x-axis) vs. external text scores (y-axis) for normal (blue) and anomaly (orange) samples.