Table of Contents
Fetching ...

Kaputt: A Large-Scale Dataset for Visual Defect Detection

Sebastian Höfer, Dorian Henning, Artemij Amiranashvili, Douglas Morrison, Mariliza Tzes, Ingmar Posner, Marc Matvienko, Alessandro Rennola, Anton Milan

TL;DR

Kaputt presents a large-scale, diverse defect-detection dataset for retail logistics, addressing the gap between saturated manufacturing benchmarks and real-world open-world item variation. The dataset includes 238{,}421 top-down RGB images of 48{,}376 items, with 29{,}316 defective instances across seven defect types and two severities, plus annotated queries and up to three unannotated references, enabling supervised, unsupervised, and hybrid methods. Systematic benchmarking across four scenarios shows that state-of-the-art methods struggle under heavy pose variation and limited defective samples, with few approaches achieving AUROC near stability and some references even hindering performance; results emphasize the need for robust, generalizable anomaly-detection techniques. Kaputt thus establishes a new benchmark and data resource for innovation in retail logistics defect detection, with practical impact on improving quality assurance and reducing waste in large-scale fulfillment operations.

Abstract

We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.

Kaputt: A Large-Scale Dataset for Visual Defect Detection

TL;DR

Kaputt presents a large-scale, diverse defect-detection dataset for retail logistics, addressing the gap between saturated manufacturing benchmarks and real-world open-world item variation. The dataset includes 238{,}421 top-down RGB images of 48{,}376 items, with 29{,}316 defective instances across seven defect types and two severities, plus annotated queries and up to three unannotated references, enabling supervised, unsupervised, and hybrid methods. Systematic benchmarking across four scenarios shows that state-of-the-art methods struggle under heavy pose variation and limited defective samples, with few approaches achieving AUROC near stability and some references even hindering performance; results emphasize the need for robust, generalizable anomaly-detection techniques. Kaputt thus establishes a new benchmark and data resource for innovation in retail logistics defect detection, with practical impact on improving quality assurance and reducing waste in large-scale fulfillment operations.

Abstract

We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD [6] and VisA [33] have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of object pose and appearance. Leading anomaly detection methods fall short when applied to this new setting. To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (and more than 29,000 defective instances), it is 40 times larger than MVTec-AD and contains more than 48,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they do not surpass 56.96% AUROC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail logistics anomaly detection. The dataset is available for download under https://www.kaputt-dataset.com.

Paper Structure

This paper contains 31 sections, 8 figures, 35 tables.

Figures (8)

  • Figure 1: Overview of defect severities and defect types. Our dataset categorizes defective samples into two severity classes: minor (top two rows) and major (bottom two rows). Additionally, each defective sample is assigned one or multiple defect types (columns), which characterize the defect(s) an item exhibits in a more fine-grained manner. The figure shows two representative samples per defect type/severity combination.
  • Figure 2: Each query image is associated with 1-3 reference images which may exhibit significant variability: (1) Benign case. (2) Defective reference image ($<1$% of all reference images). (3) Significant background variation, and $< 3$ reference images available. (4) Pose variability (front vs. back).
  • Figure 3: Examples for challenging defective cases (from left to right). (1) Unobservable cases. A small stripe in the bottom half of the CD could be either a reflection or a crack in the cover. (2) Complex cases. The detergent pack looks intact, but at a second look the powder on the tray next to it item indicates a spillage defect. (3) Ambiguous cases. The multi-pack is complete but its units are unordered, which is acceptable but has different visual appearance than the corresponding reference image.
  • Figure 4: Left: Distribution of item material types and defect severities. We observe that items with cardboard material dominate the dataset, followed by plastic bags/cases and books. Right: Distribution of defect types per defect severity. We find that deformation is the most common defect type, however, it mostly results in minor defect severity, similar to penetration, actuation and superficial. In contrast, deconstruction and spillage commonly result in major defect severity.
  • Figure 5: Overview of the defect types used to annotate defective samples (bold font) and related and colloquial characterization of the defect types (in bubbles). The proximity of the bubbles and their overlap indicates which defect types are similar/related.
  • ...and 3 more figures