Table of Contents
Fetching ...

Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion

Yifan Liu, Xun Xu, Shijie Li, Jingyi Liao, Xulei Yang

TL;DR

This work tackles multi-view industrial anomaly detection by embedding geometric priors into cross-view fusion. It introduces an Epipolar Attention Module (EAM) that constrains cross-view attention along epipolar lines, and a multi-center pre-training (MCP) strategy with per-view memory banks and synthetic negative samples to stabilize learning. The combination yields a memory-bank–based, geometry-aware framework (MVEAD) that outperforms state-of-the-art methods on Real-IAD in both sample- and image-level metrics, especially under multi-class settings. The approach offers practical benefits for real-world inspection pipelines by improving robustness and efficiency in multi-view anomaly localization.

Abstract

Multi-camera systems provide richer contextual information for industrial anomaly detection. However, traditional methods process each view independently, disregarding the complementary information across viewpoints. Existing multi-view anomaly detection approaches typically employ data-driven cross-view attention for feature fusion but fail to leverage the unique geometric properties of multi-camera setups. In this work, we introduce an epipolar geometry-constrained attention module to guide cross-view fusion, ensuring more effective information aggregation. To further enhance the potential of cross-view attention, we propose a pretraining strategy inspired by memory bank-based anomaly detection. This approach encourages normal feature representations to form multiple local clusters and incorporate multi-view aware negative sample synthesis to regularize pretraining. We demonstrate that our epipolar guided multi-view anomaly detection framework outperforms existing methods on the state-of-the-art multi-view anomaly detection dataset.

Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion

TL;DR

This work tackles multi-view industrial anomaly detection by embedding geometric priors into cross-view fusion. It introduces an Epipolar Attention Module (EAM) that constrains cross-view attention along epipolar lines, and a multi-center pre-training (MCP) strategy with per-view memory banks and synthetic negative samples to stabilize learning. The combination yields a memory-bank–based, geometry-aware framework (MVEAD) that outperforms state-of-the-art methods on Real-IAD in both sample- and image-level metrics, especially under multi-class settings. The approach offers practical benefits for real-world inspection pipelines by improving robustness and efficiency in multi-view anomaly localization.

Abstract

Multi-camera systems provide richer contextual information for industrial anomaly detection. However, traditional methods process each view independently, disregarding the complementary information across viewpoints. Existing multi-view anomaly detection approaches typically employ data-driven cross-view attention for feature fusion but fail to leverage the unique geometric properties of multi-camera setups. In this work, we introduce an epipolar geometry-constrained attention module to guide cross-view fusion, ensuring more effective information aggregation. To further enhance the potential of cross-view attention, we propose a pretraining strategy inspired by memory bank-based anomaly detection. This approach encourages normal feature representations to form multiple local clusters and incorporate multi-view aware negative sample synthesis to regularize pretraining. We demonstrate that our epipolar guided multi-view anomaly detection framework outperforms existing methods on the state-of-the-art multi-view anomaly detection dataset.

Paper Structure

This paper contains 11 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of epipolar geometry guided attention in a multi-camera system. The corresponding patch of $P_{1i}$ in views lies on the epipolar line $l_{2i}=P^\top_{1i}F_{12}$ (red line). Attention map is restricted to the patches on the epipolar line.
  • Figure 2: Overview of the proposed multi-center pre-training framework. Multi-view images are processed by a frozen backbone, followed by cross-view fusion via the EAM module to construct a patch-level feature bank. Features from reference view are randomly masked, while other views remain unchanged. In the EAM output, masked patches from the reference view and patches explicitly selected from source views via \ref{['eq:neg_samp']} serve as negatives, with the loss computed using \ref{['eq:final_loss']}.
  • Figure 3: After randomly selecting the support view, a random mask is applied. Following EAM processing, the Top-K patches along the epipolar line corresponding to the masked region in the reference view are selected based on \ref{['eq:neg_samp']} and used as negative patches along with the masked patch.
  • Figure 4: Anomaly segmentation results for two samples from the Real-IAD dataset: the audiojack class (left) and the USB class (right).
  • Figure 5: Computation overhead for different backbone sizes (DINOv2-small/base/large/giant) and center numbers $K$.