Table of Contents
Fetching ...

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

Lei Fan, Dongdong Fan, Zhiguang Hu, Yiwen Ding, Donglin Di, Kai Yi, Maurice Pagnucco, Yang Song

TL;DR

MANTA addresses the critical need for anomaly detection benchmarks on tiny objects by introducing a large-scale, multi-view visual dataset (137K images across 38 categories in five domains) and two complementary text subsets (DeclK and ConsL) to support visual-text anomaly tasks. It establishes a five-view data acquisition pipeline, pixel-level anomaly annotations, and a versatile evaluation framework with five training/testing settings, plus a BLIP-2 LoRA-based baseline for ConsL. Extensive experiments reveal the challenges of tiny-object AD, show performance gains from multi-view data, and demonstrate the potential and current limits of text-driven anomaly reasoning, highlighting a path toward robust visual-language solutions. The dataset and benchmarks are poised to catalyze advances in practical anomaly detection for tiny objects in domain areas such as agriculture, medicine, electronics, mechanics, and groceries.

Abstract

We present MANTA, a visual-text anomaly detection dataset for tiny objects. The visual component comprises over 137.3K images across 38 object categories spanning five typical domains, of which 8.6K images are labeled as anomalous with pixel-level annotations. Each image is captured from five distinct viewpoints to ensure comprehensive object coverage. The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies across various domains and specific categories, with detailed explanations for < what, why, how>, including causes and visual characteristics; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty, each paired with images and corresponded answer explanations. We also propose a baseline for visual-text tasks and conduct extensive benchmarking experiments to evaluate advanced methods across different settings, highlighting the challenges and efficacy of our dataset.

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

TL;DR

MANTA addresses the critical need for anomaly detection benchmarks on tiny objects by introducing a large-scale, multi-view visual dataset (137K images across 38 categories in five domains) and two complementary text subsets (DeclK and ConsL) to support visual-text anomaly tasks. It establishes a five-view data acquisition pipeline, pixel-level anomaly annotations, and a versatile evaluation framework with five training/testing settings, plus a BLIP-2 LoRA-based baseline for ConsL. Extensive experiments reveal the challenges of tiny-object AD, show performance gains from multi-view data, and demonstrate the potential and current limits of text-driven anomaly reasoning, highlighting a path toward robust visual-language solutions. The dataset and benchmarks are poised to catalyze advances in practical anomaly detection for tiny objects in domain areas such as agriculture, medicine, electronics, mechanics, and groceries.

Abstract

We present MANTA, a visual-text anomaly detection dataset for tiny objects. The visual component comprises over 137.3K images across 38 object categories spanning five typical domains, of which 8.6K images are labeled as anomalous with pixel-level annotations. Each image is captured from five distinct viewpoints to ensure comprehensive object coverage. The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies across various domains and specific categories, with detailed explanations for < what, why, how>, including causes and visual characteristics; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty, each paired with images and corresponded answer explanations. We also propose a baseline for visual-text tasks and conduct extensive benchmarking experiments to evaluate advanced methods across different settings, highlighting the challenges and efficacy of our dataset.

Paper Structure

This paper contains 15 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The data distribution of the visual component in MANTA. It contains 137,338 multi-view images across 38 categories from five typical domains: Agriculture, medicine, electronics, mechanics, and groceries. Each histogram represents a category, with different color groups indicating domains. For better visualization, a $\log_{10}$ scale is used on the y-axis.
  • Figure 2: Data acquisition. We collected tiny objects from various domains, constructed a prototype to capture visual information, and annotated them using CVAT and ChatGPT.
  • Figure 3: Examples illustrating the image data challenges.
  • Figure 4: Format and data distribution of Declarative Knowledge.
  • Figure 5: Overview of Constructivist Learning. We provide 2K MCQs with both easy and hard difficulty levels, each containing five options and explanations.
  • ...and 4 more figures