VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Hiroto Nakata; Yawen Zou; Shunsuke Sakai; Shun Maeda; Chunzhi Gu; Yijin Wei; Shangce Gao; Chao Zhang

VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Hiroto Nakata, Yawen Zou, Shunsuke Sakai, Shun Maeda, Chunzhi Gu, Yijin Wei, Shangce Gao, Chao Zhang

Abstract

Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: https://github.com/nkthiroto/VID-AD.

VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Abstract

Paper Structure (23 sections, 8 equations, 5 figures, 7 tables)

This paper contains 23 sections, 8 equations, 5 figures, 7 tables.

Introduction
Related Work
VAD Datasets
VAD Methods
VID-AD Dataset
Dataset Overview
Scenarios and Logical Anomaly Taxonomy
Capture Conditions for Vision-Induced Distraction
Benchmark Protocol and Dataset Statistics
Proposed Method
Problem Setting and Overview
Vision-to-Text Description and Negative Synthesis
Contrastive Fine-tuning of Text Encoder
Inference and Statistical Ensemble Scoring
Experiments
...and 8 more sections

Figures (5)

Figure 1: Example results from the Tools scenario in VID-AD. The top row (green border) shows a Normal Sample, and the bottom row (blue border) shows a Logical Anomaly where a screw is missing. From left to right, we present the Input Image, Anomaly Map, and Detected Anomaly, where red highlighted areas indicate regions classified as anomalous based on a detection threshold. EfficientAD batzner2024efficientad produces strong anomaly responses even for the Normal Sample, resulting in false positives; it also fails to clearly localize the missing screw in the Logical Anomaly case.
Figure 2: Overview of VID-AD. Columns correspond to the 10 manufacturing scenarios (Sticks, Fruits, Tools, Cookies, Tapes, Stationery, Ropes, Blocks, Dishes, Balls), with two representative images shown for each scenario. The first five rows correspond to the five capture conditions: White BG (default), Cable BG, Mesh BG, Blurry CD, and Low-light CD (BG: background, CD: condition). The bottom row shows logical anomaly examples for each scenario, with red boxes highlighting the anomalous regions. Each scenario is defined by a pair of logical constraints, as indicated beneath each scenario name (abbreviations: Q = Quantity, L = Length, T = Type, P = Placement, R = Relation). Each cell in the first five rows represents a one-class benchmark task defined by a scenario and a capture condition, yielding 50 tasks in total (10 scenarios $\times$ five capture conditions).
Figure 3: Pipeline of the proposed unsupervised logical anomaly detector. For each image, a frozen Vision-Language Model (VLM) generates a single logic-focused text conditioned on a scenario-specific text prompt. During training, one contradictory negative text is synthesized from each positive text in a text-only manner, and a text encoder (BERT) is fine-tuned via contrastive learning, where dropout-augmented embeddings form the anchor–positive pair and synthesized texts serve as negatives. During inference, the test text is embedded by the fine-tuned encoder and compared with the set of training embeddings. The final normality score is computed by distance-based $k\text{-}$nearest-neighbor aggregation ($k=5$), where larger values indicate more normal samples.
Figure 4: Qualitative anomaly maps of CSAD on the Cookies scenario under five capture conditions. Columns correspond to White BG, Cable BG, Mesh BG, Low-light CD, and Blurry CD. Rows show, from top to bottom: a normal sample, its anomaly map, a logical anomaly, and its anomaly map. CSAD's anomaly responses vary substantially with the capture condition, often highlighting background patterns or low-level appearance variations (e.g., repetitive textures in Mesh BG or contrast degradation in Low-light CD) in addition to the relevant objects, illustrating the impact of vision-induced distraction on pixel-level localization.
Figure 5: Scenario-wise sensitivity of the proposed method across capture conditions on VID-AD. For each scenario, dots show the AUROC under each of the five capture conditions (see legend), the vertical bar indicates the min–max range, and the diamond marker denotes the mean across conditions.

VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Abstract

VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Authors

Abstract

Table of Contents

Figures (5)