Table of Contents
Fetching ...

Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Ankan Bhunia, Changjian Li, Hakan Bilen

TL;DR

This work introduces Odd-One-Out, a scene-specific anomaly detection framework that identifies oddly behaving objects by comparing multiple instances within the same scene using multi-view imagery. It builds a 3D object-centric feature volume from multiple views, enhances it via differentiable rendering and DINOv2 feature distillation, and performs cross-instance matching with sparse voxel attention to predict per-object anomalies and 3D locations. The approach is evaluated on two new benchmarks, ToysAD-8K and PartsAD-15K, and shows strong generalization, especially to unseen categories, outperforming reconstruction-based and multi-view baselines. The work advances practical AD by leveraging inter-object context and robust 3D representations, with potential impact on manufacturing quality control and related domains.

Abstract

This paper introduces a novel anomaly detection (AD) problem aimed at identifying `odd-looking' objects within a scene by comparing them to other objects present. Unlike traditional AD benchmarks with fixed anomaly criteria, our task detects anomalies specific to each scene by inferring a reference group of regular objects. To address occlusions, we use multiple views of each scene as input, construct 3D object-centric models for each instance from 2D views, enhancing these models with geometrically consistent part-aware representations. Anomalous objects are then detected through cross-instance comparison. We also introduce two new benchmarks, ToysAD-8K and PartsAD-15K as testbeds for future research in this task. We provide a comprehensive analysis of our method quantitatively and qualitatively on these benchmarks.

Odd-One-Out: Anomaly Detection by Comparing with Neighbors

TL;DR

This work introduces Odd-One-Out, a scene-specific anomaly detection framework that identifies oddly behaving objects by comparing multiple instances within the same scene using multi-view imagery. It builds a 3D object-centric feature volume from multiple views, enhances it via differentiable rendering and DINOv2 feature distillation, and performs cross-instance matching with sparse voxel attention to predict per-object anomalies and 3D locations. The approach is evaluated on two new benchmarks, ToysAD-8K and PartsAD-15K, and shows strong generalization, especially to unseen categories, outperforming reconstruction-based and multi-view baselines. The work advances practical AD by leveraging inter-object context and robust 3D representations, with potential impact on manufacturing quality control and related domains.

Abstract

This paper introduces a novel anomaly detection (AD) problem aimed at identifying `odd-looking' objects within a scene by comparing them to other objects present. Unlike traditional AD benchmarks with fixed anomaly criteria, our task detects anomalies specific to each scene by inferring a reference group of regular objects. To address occlusions, we use multiple views of each scene as input, construct 3D object-centric models for each instance from 2D views, enhancing these models with geometrically consistent part-aware representations. Anomalous objects are then detected through cross-instance comparison. We also introduce two new benchmarks, ToysAD-8K and PartsAD-15K as testbeds for future research in this task. We provide a comprehensive analysis of our method quantitatively and qualitatively on these benchmarks.
Paper Structure (16 sections, 8 equations, 15 figures, 3 tables)

This paper contains 16 sections, 8 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: (a) We propose a new anomaly detection task focused on identifying 'odd-looking' objects relative to other instances within a scene. Inspired by real-world quality control in production environments, this task aims to detect subtle variations in geometry and texture, including defects like cracks and fractures, in a group of manufactured samples. (b) Our setting is scene-specific, requiring a comparison of object instances within the input scene, unlike the standard AD setting, which takes only a single object as input. (c) Our matching-based paradigm enables cross-category performance.
  • Figure 2: Overview of our framework. We extract features from a sequence of input views using a 2D CNN and back-project them into a 3D volume, which is then refined with a 3D CNN, resulting in $\bm{F}_v$. Next, we extract object-centric feature volumes $\{\bm{z}_n\}_{n=1}^{N}$, which are fed into the cross-instance matching module to learn correlations among objects using sparse voxel attention. To improve the 3D representation of the scene, we distill the knowledge of a 2D vision model namely DINOv2, and integrate the learned knowledge into our 3D network via differentiable rendering. This aids in obtaining a part-aware and geometrically consistent 3D feature representation.
  • Figure 3: AD qualitative results on the unseen test categories of ToysAD-8K (top row) and the test set of PartsAD-15K (bottom row) using our proposed framework. Due to limited space, two views are shown per scene. The model prediction is shown with a yellow bounding box: a 3D box for the first example (banana) and a projected box for the others for simplicity. Our model successfully predicts the correct object in all cases shown above.
  • Figure 4: Resolving occlusion and 3D ambiguity using multi-view images. The anomaly 'sheep' in top has a missing tail (only visible in the 2nd view due to occlusion), and the 'hammer' handle in bottom is bent (only apparent from the 2nd view-angle due to 3D ambiguities).
  • Figure 5: Impact of the number of views (left) and object count (right) on model performance.
  • ...and 10 more figures