Table of Contents
Fetching ...

SAM-LAD: Segment Anything Model Meets Zero-Shot Logic Anomaly Detection

Yun Peng, Xiao Lin, Nachuan Ma, Jiayuan Du, Chuangwei Liu, Chengju Liu, Qijun Chen

TL;DR

SAM-LAD presents a zero-shot, plug-and-play framework for detecting logical and structural anomalies in complex scenes by integrating SAM-based object segmentation with a scene-wide object matching and anomaly-estimation pipeline. It builds an offline template bank from normal samples using a pre-trained backbone (e.g., DINOv2), retrieves $k$ nearest normals for a query, and generates per-object feature maps via FeatUp and SAM masks. Object descriptors are compacted with Dynamic Channel Graph Attention (DCGA) and matched through an Object Matching Module (OMM); anomalies are finally quantified by the Anomaly Measurement Module (AMM) using per-patch multivariate Gaussian statistics and Mahalanobis distance. The approach achieves state-of-the-art results on MVTec LOCO AD, MVTec AD, and DigitAnatomy, while operating in a zero-shot regime and requiring no task-specific retraining, thereby enabling robust practical deployment across diverse industrial and medical contexts.

Abstract

Visual anomaly detection is vital in real-world applications, such as industrial defect detection and medical diagnosis. However, most existing methods focus on local structural anomalies and fail to detect higher-level functional anomalies under logical conditions. Although recent studies have explored logical anomaly detection, they can only address simple anomalies like missing or addition and show poor generalizability due to being heavily data-driven. To fill this gap, we propose SAM-LAD, a zero-shot, plug-and-play framework for logical anomaly detection in any scene. First, we obtain a query image's feature map using a pre-trained backbone. Simultaneously, we retrieve the reference images and their corresponding feature maps via the nearest neighbor search of the query image. Then, we introduce the Segment Anything Model (SAM) to obtain object masks of the query and reference images. Each object mask is multiplied with the entire image's feature map to obtain object feature maps. Next, an Object Matching Model (OMM) is proposed to match objects in the query and reference images. To facilitate object matching, we further propose a Dynamic Channel Graph Attention (DCGA) module, treating each object as a keypoint and converting its feature maps into feature vectors. Finally, based on the object matching relations, an Anomaly Measurement Model (AMM) is proposed to detect objects with logical anomalies. Structural anomalies in the objects can also be detected. We validate our proposed SAM-LAD using various benchmarks, including industrial datasets (MVTec Loco AD, MVTec AD), and the logical dataset (DigitAnatomy). Extensive experimental results demonstrate that SAM-LAD outperforms existing SoTA methods, particularly in detecting logical anomalies.

SAM-LAD: Segment Anything Model Meets Zero-Shot Logic Anomaly Detection

TL;DR

SAM-LAD presents a zero-shot, plug-and-play framework for detecting logical and structural anomalies in complex scenes by integrating SAM-based object segmentation with a scene-wide object matching and anomaly-estimation pipeline. It builds an offline template bank from normal samples using a pre-trained backbone (e.g., DINOv2), retrieves nearest normals for a query, and generates per-object feature maps via FeatUp and SAM masks. Object descriptors are compacted with Dynamic Channel Graph Attention (DCGA) and matched through an Object Matching Module (OMM); anomalies are finally quantified by the Anomaly Measurement Module (AMM) using per-patch multivariate Gaussian statistics and Mahalanobis distance. The approach achieves state-of-the-art results on MVTec LOCO AD, MVTec AD, and DigitAnatomy, while operating in a zero-shot regime and requiring no task-specific retraining, thereby enabling robust practical deployment across diverse industrial and medical contexts.

Abstract

Visual anomaly detection is vital in real-world applications, such as industrial defect detection and medical diagnosis. However, most existing methods focus on local structural anomalies and fail to detect higher-level functional anomalies under logical conditions. Although recent studies have explored logical anomaly detection, they can only address simple anomalies like missing or addition and show poor generalizability due to being heavily data-driven. To fill this gap, we propose SAM-LAD, a zero-shot, plug-and-play framework for logical anomaly detection in any scene. First, we obtain a query image's feature map using a pre-trained backbone. Simultaneously, we retrieve the reference images and their corresponding feature maps via the nearest neighbor search of the query image. Then, we introduce the Segment Anything Model (SAM) to obtain object masks of the query and reference images. Each object mask is multiplied with the entire image's feature map to obtain object feature maps. Next, an Object Matching Model (OMM) is proposed to match objects in the query and reference images. To facilitate object matching, we further propose a Dynamic Channel Graph Attention (DCGA) module, treating each object as a keypoint and converting its feature maps into feature vectors. Finally, based on the object matching relations, an Anomaly Measurement Model (AMM) is proposed to detect objects with logical anomalies. Structural anomalies in the objects can also be detected. We validate our proposed SAM-LAD using various benchmarks, including industrial datasets (MVTec Loco AD, MVTec AD), and the logical dataset (DigitAnatomy). Extensive experimental results demonstrate that SAM-LAD outperforms existing SoTA methods, particularly in detecting logical anomalies.
Paper Structure (33 sections, 16 equations, 14 figures, 8 tables)

This paper contains 33 sections, 16 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: (a) Example of the structural anomalies. (b) The anomaly-free images of the category breakfast box and screw bag. (c) The anomaly score map of the Patchcore, GLCF, and SAM-LAD for logical anomaly detection.
  • Figure 2: Pipeline of the proposed framework, which consists of four stages. The first stage is an offline operation, building an anomaly-free template features bank. The second stage is extracting a feature map from a query image. The next stage utilizes SAM to obtain object feature maps further. The last stage involves matching the objects in the query image with those in the reference images one by one. We calculate anomalies within each object and obtain the final anomaly score maps using the obtained matching relationships.
  • Figure 3: (a) A target object of breakfast box category, (b) Unexpected overly detailed objects mask.
  • Figure 4: The anomaly-free image's object mask of each category dataset from SAM.
  • Figure 5: Architecture of the DCGA module.
  • ...and 9 more figures