Table of Contents
Fetching ...

Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

Guangyao Zhai, Yue Zhou, Xinyan Deng, Lars Heckler, Nassir Navab, Benjamin Busam

TL;DR

This work introduces FoundAD, a few-shot, multi-class anomaly detector that leverages frozen foundation visual encoders and a lightweight nonlinear projector to map anomalous embeddings back onto the natural image manifold. By training with synthetic anomalies generated via a CutPaste-inspired module and operating entirely in latent space, FoundAD achieves strong detection and localization while using far fewer parameters than prior approaches. Extensive experiments on MVTec-AD and VisA show state-of-the-art performance across multiple encoders and few-shot settings, with notable efficiency gains. The findings suggest that foundation visual features alone can power robust anomaly detection, reducing reliance on task-specific models or text prompts and enabling practical industrial deployment.

Abstract

Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.

Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

TL;DR

This work introduces FoundAD, a few-shot, multi-class anomaly detector that leverages frozen foundation visual encoders and a lightweight nonlinear projector to map anomalous embeddings back onto the natural image manifold. By training with synthetic anomalies generated via a CutPaste-inspired module and operating entirely in latent space, FoundAD achieves strong detection and localization while using far fewer parameters than prior approaches. Extensive experiments on MVTec-AD and VisA show state-of-the-art performance across multiple encoders and few-shot settings, with notable efficiency gains. The findings suggest that foundation visual features alone can power robust anomaly detection, reducing reliance on task-specific models or text prompts and enabling practical industrial deployment.

Abstract

Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.

Paper Structure

This paper contains 35 sections, 3 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Manifold Projection. Large training sets enable foundation models to learn the manifold of natural images (illustrated schematically as a 2D surface), which lies in a higher-dimensional feature space. Normal images such as $I_r$ are embedded onto this manifold. Images with anomalies ($I_s^1$, $I_s^2$) lie further away from this manifold. The distance $D\left( f_s^i, f_r \right)$ correlates with the pixel amount of the anomaly in the image. We learn a non-linear projection operator $\phi$ that projects the embedding $f_a$ of an anomalous image $I_a$ onto its corresponding normal feature $f_a^\ast$. Feature comparison enables few-shot anomaly detection $I_h$.
  • Figure 2: Correlation of Anomaly Area with Feature Distance. Two foundation encoders under different paradigms are shown. Upper left/right: Real images with corresponding synthetic and real anomalies. Lower left/right: Coloured PCA visualizations of their embedded features using SigLIP zhai2023sigmoid (left) and DINOv2 Dinov2 (right). Center: L2-feature distance of embeddings for synthetic anomalies of increasing pixel amount on a real image. A clear correlation is visible for both foundation models.
  • Figure 3: A. Training pipeline. Normal training images $I_r$ are first processed by the anomaly synthesis module to generate augmented samples $I_s$. Feature embeddings of the augmented image and the original image are extracted by the Anomaly-Aware Encoder$\theta_a$ and the Reference Encoder$\theta_b$, respectively ($\theta_a=\theta_b=\theta$). The Manifold Projector$\phi$ is trained to map the feature embeddings $f_s$ of the synthesized anomalous image towards the normal feature $f_r$. The training objective is to minimize the distance $D\left( f_r^\ast, f_r \right)$ between the projected feature $f_r^\ast$ and the reference feature $f_r$. B. Inference pipeline. During inference, an input image is processed by AE to extract feature embeddings $f_a$, which are then projected by the Projector to $f_a^\ast$. The anomaly score $D\left( f_a^\ast, f_a \right)$ for each patch is computed. We aggregate the Top-K highest patch-level anomaly scores and generate an anomaly heatmap $I_h$ by upsampling to the original image resolution.
  • Figure 4: Bubble chart of AUROC results across different methods, averaged over MVTec-AD and VisA from \ref{['table:results_multi_class']}. The smaller the circle is, the fewer parameters it has.
  • Figure 5: Qualitative comparison with few-shot baselines in 1-shot setting. We directly compare our results with the ones cropped from IIPAD lv2025oneforall.
  • ...and 8 more figures