Table of Contents
Fetching ...

Investigating the Semantic Robustness of CLIP-based Zero-Shot Anomaly Segmentation

Kevin Stangl, Marius Arvinte, Weilin Xu, Cory Cornelius

TL;DR

This work probes the semantic robustness of CLIP-based zero-shot anomaly segmentation (WinCLIP) by optimizing per-sample, bounded test-time perturbations—rotation, hue, and saturation shifts—to maximize a segmentation loss. By evaluating both uniform and per-sample lower bounds on MVTec and VisA with multiple CLIP backbones, the study uncovers consistent performance degradation (up to ~20% in pAUROC and ~40% in AUPRO), highlighting a significant robustness gap under distribution shifts. The authors provide a differentiable perturbation framework, outline optimization strategies, and show that color-based shifts pose greater challenges than rotations, particularly on harder datasets like VisA. The findings underscore the need for explicit lower-bound robustness evaluations and suggest directions for incorporating robust augmentations during training or tailoring test-time perturbations to object types to improve reliability in practical deployments.

Abstract

Zero-shot anomaly segmentation using pre-trained foundation models is a promising approach that enables effective algorithms without expensive, domain-specific training or fine-tuning. Ensuring that these methods work across various environmental conditions and are robust to distribution shifts is an open problem. We investigate the performance of WinCLIP [14] zero-shot anomaly segmentation algorithm by perturbing test data using three semantic transformations: bounded angular rotations, bounded saturation shifts, and hue shifts. We empirically measure a lower performance bound by aggregating across per-sample worst-case perturbations and find that average performance drops by up to 20% in area under the ROC curve and 40% in area under the per-region overlap curve. We find that performance is consistently lowered on three CLIP backbones, regardless of model architecture or learning objective, demonstrating a need for careful performance evaluation.

Investigating the Semantic Robustness of CLIP-based Zero-Shot Anomaly Segmentation

TL;DR

This work probes the semantic robustness of CLIP-based zero-shot anomaly segmentation (WinCLIP) by optimizing per-sample, bounded test-time perturbations—rotation, hue, and saturation shifts—to maximize a segmentation loss. By evaluating both uniform and per-sample lower bounds on MVTec and VisA with multiple CLIP backbones, the study uncovers consistent performance degradation (up to ~20% in pAUROC and ~40% in AUPRO), highlighting a significant robustness gap under distribution shifts. The authors provide a differentiable perturbation framework, outline optimization strategies, and show that color-based shifts pose greater challenges than rotations, particularly on harder datasets like VisA. The findings underscore the need for explicit lower-bound robustness evaluations and suggest directions for incorporating robust augmentations during training or tailoring test-time perturbations to object types to improve reliability in practical deployments.

Abstract

Zero-shot anomaly segmentation using pre-trained foundation models is a promising approach that enables effective algorithms without expensive, domain-specific training or fine-tuning. Ensuring that these methods work across various environmental conditions and are robust to distribution shifts is an open problem. We investigate the performance of WinCLIP [14] zero-shot anomaly segmentation algorithm by perturbing test data using three semantic transformations: bounded angular rotations, bounded saturation shifts, and hue shifts. We empirically measure a lower performance bound by aggregating across per-sample worst-case perturbations and find that average performance drops by up to 20% in area under the ROC curve and 40% in area under the per-region overlap curve. We find that performance is consistently lowered on three CLIP backbones, regardless of model architecture or learning objective, demonstrating a need for careful performance evaluation.
Paper Structure (14 sections, 6 equations, 10 figures, 3 tables)

This paper contains 14 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Effects of the three augmentations applied to the same anomalous (a large crack in the shell of a hazelnut) MVTec sample. The third column represents the original sample.
  • Figure 2: Zero-shot anomaly segmentation performance when the same rotation angle $\theta$ is applied to the MVTec test set for rotation-invariant objects.
  • Figure 3: Zero-shot anomaly segmentation performance when the same additive (modulo $2\pi$) hue shift $\delta_h$ is applied to the MVTec test set for rotation-invariant objects.
  • Figure 4: Zero-shot anomaly segmentation performance for two datasets (MVTec and VisA, left and right, respectively) using three CLIP backbones (ViT-B/16+, ViT-L/14, and adversarially fine-tuned ViT-L/14-FARE$^2$ for WinCLIP. The three test-time, worst-case semantic perturbations (angle, saturation, hue) are considered either separately, or simultaneously (3D). The bars show the difference between the original test sets and the considered lower bounds.
  • Figure 5: Empirical distribution of per-sample worst-case rotation angles for all rotationally invariant object in MVTec. The slight peak around the origin is caused by the optimization being sub-optimal due to the de-rotation of $\tilde{y}_\textrm{aug}$ using zero value extrapolation in the corners.
  • ...and 5 more figures