Table of Contents
Fetching ...

SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

Tomoyasu Nanaumi, Yukino Tsuzuki, Junichi Okubo, Junichiro Fujii, Takayoshi Yamashita

Abstract

We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

Abstract

We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

Paper Structure

This paper contains 18 sections, 17 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of SPG (Sparse-Projected Guides). Stage 1 trains a Sparse Autoencoder (SAE) on patch-token features extracted by a frozen visual encoder to learn a dictionary in which normal/anomaly guide vectors can be parameterized as sparse coefficients. Stage 2 freezes both the encoder and SAE, and learns only non-negative, sparsity-regularized guide coefficients that generate normal/anomaly guide vectors via the SAE dictionary. At inference, cosine similarities between patch tokens and the two guides are converted into an anomaly probability map via a two-class softmax, which is upsampled to the image resolution. Image-level anomaly scores are computed by aggregating the anomaly map (default: max pooling; see \ref{['equ:lse_pooling']} for the log-sum-exp generalization with temperature $\tau$).
  • Figure 2: Sensitivity to SAE hyperparameters. The Heatmaps summarize cross-dataset performance when sweeping the SAE dictionary width $C$ and the TopK sparsity level $k$. The top row reports VisA $\rightarrow$ MVTec AD transfer and the bottom row reports MVTec $\rightarrow$ VisA transfer. Each column corresponds to image-level AUROC / AP and pixel-level AUROC / AUPRO, respectively, highlighting that the optimal $(C, k)$ can depend on both the transfer direction and the evaluation metric.
  • Figure 3: Ablation of image-level anomaly-score aggregation via temperature-controlled log-sum-exp pooling. We vary the temperature $\tau$ in log-sum-exp pooling applied to the upsampled anomaly map to produce an image-level anomaly score. Curves report image-level AUROC and AP for both transfer directions (left: VisA $\rightarrow$ MVTec; right: MVTec $\rightarrow$ VisA). The figure compares the $\tau \rightarrow 0$ (max-like) regime against larger $\tau$ (mean-like) aggregation, showing that more max-like pooling better preserves localized high-confidence anomaly responses for detection.
  • Figure 4: Effect of the visual backbone on SPG under cross-dataset transfer. We instantiate SPG with different frozen encoders and rerun Stage 1 (SAE training on auxiliary patch tokens) and Stage 2 (guide-coefficient learning) for each backbone under the same protocol. Results are reported for two transfer directions (left: VisA $\rightarrow$ MVTec; right: MVTec $\rightarrow$ VisA). Each plot reports image-level AUROC/AP and pixel-level AUROC/AUPRO, isolating how backbone choice influences both detection and segmentation quality within the same guide-based scoring rule.
  • Figure 5: Qualitative interpretation of SAE dictionary atoms emphasized by the learned anomaly guide. We select representative atoms from the anomaly guide’s active set by choosing those with the largest learned anomaly coefficients. Each atom is visualized by retrieving top-activating patches from the auxiliary dataset and by overlaying an upsampled activation heatmap on the corresponding images for spatial context. This qualitative analysis illustrates that SPG’s anomaly criterion can be traced to a sparse subset of SAE atoms, some of which appear broadly activated across categories while others are more category-biased.
  • ...and 1 more figures