Table of Contents
Fetching ...

Looking 3D: Anomaly Detection with 2D-3D Alignment

Ankan Bhunia, Changjian Li, Hakan Bilen

TL;DR

A novel transformer-based approach is proposed that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection.

Abstract

Automatic anomaly detection based on visual cues holds practical significance in various domains, such as manufacturing and product quality assessment. This paper introduces a new conditional anomaly detection problem, which involves identifying anomalies in a query image by comparing it to a reference shape. To address this challenge, we have created a large dataset, BrokenChairs-180K, consisting of around 180K images, with diverse anomalies, geometries, and textures paired with 8,143 reference 3D shapes. To tackle this task, we have proposed a novel transformer-based approach that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection. Our approach has been rigorously evaluated through comprehensive experiments, serving as a benchmark for future research in this domain.

Looking 3D: Anomaly Detection with 2D-3D Alignment

TL;DR

A novel transformer-based approach is proposed that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection.

Abstract

Automatic anomaly detection based on visual cues holds practical significance in various domains, such as manufacturing and product quality assessment. This paper introduces a new conditional anomaly detection problem, which involves identifying anomalies in a query image by comparing it to a reference shape. To address this challenge, we have created a large dataset, BrokenChairs-180K, consisting of around 180K images, with diverse anomalies, geometries, and textures paired with 8,143 reference 3D shapes. To tackle this task, we have proposed a novel transformer-based approach that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection. Our approach has been rigorously evaluated through comprehensive experiments, serving as a benchmark for future research in this domain.
Paper Structure (14 sections, 8 equations, 8 figures, 3 tables)

This paper contains 14 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: We propose a new conditional AD task that aims to identify and localize anomalies in a query image by comparing it to a reference shape. The anomalous region is shown in a yellow bounding box. For instance, the right leg of the blue sofa is rectangular unlike the cylindrical one in its reference shape.
  • Figure 2: Example anomaly instances from our BrokenChairs-180K dataset. Our dataset consists of around 100$K$ anomaly images. In the top row, some example anomaly instances are shown, along with the ground truth bounding boxes and segmentation masks in the bottom row. The red mask is used to indicate parts with anomalies, and a green contour line highlights their respective regions prior to applying any anomaly, and the bounding box is shown as blue rectangular boxes. (figure best viewed in zoom)
  • Figure 3: Overall architecture of our proposed CMT framework for conditional AD task. Our CMT takes the following inputs: the query image $\bm{q}$ and the rendered multi-view images $\{\bm{v}_{n}\}_{n=1}^N$. We extract query features $\bm{f}^q$ and multi-view features $\bm{F}^v$ using the encoder $\varphi$. Additionally, we use 3D positional encoding (3DPE) to obtain 3D positional features $\bm{P}^v$ for the multi-view images. Next, $\bm{F}^v$ and $\bm{P}^v$ are concatenated and fed to the correspondence-guided attention (CGA) network, denoted as $\phi$, along with the query features $\bm{f}^q$. The CGA network selectively conditions the final prediction on a small subset of the most related patches from multi-view images through a top-$k$ sparse cross-attention (TKCA) mechanism. The view-agnostic local feature alignment (VLFA) serves to align the encoder output features to achieve view-agnostic representation through semi-supervised learning.
  • Figure 4: Our proposed correspondence-guided attention (CGA). The CGA comprises $B$ transformer-based blocks, each consisting of a standard self-attention module followed by a top-$k$ sparse cross-attention (TKCA) module.
  • Figure 5: Top-$k$ sparse attention-span visualization. For the query point (yellow), similarity heatmaps (first row) and top-$k$ attention-span (second row) across multiple views are shown.
  • ...and 3 more figures