Table of Contents
Fetching ...

Revisiting Multimodal Fusion for 3D Anomaly Detection from an Architectural Perspective

Kaifang Long, Guoyang Xie, Lianbo Ma, Jiaqi Liu, Zhichao Lu

TL;DR

The paper investigates how multimodal fusion topology affects 3D anomaly detection, showing that intra- and inter-module fusion designs significantly influence performance. It introduces 3D-ADNAS, a two-level NAS framework that jointly searches modality-specific modules and fusion strategies to optimize 3D-AD, achieving improvements in accuracy, speed, and memory across Eyecandies and MVTec 3D-AD, and showing potential for few-shot tasks. The approach combines theoretical DST-based insights with empirical NAS-driven architecture search to reveal a practical path toward architecture-aware 3D-AD. This work highlights the practical impact of fusion architecture design for industrial-quality 3D anomaly detection and offers a scalable methodology for deploying efficient multimodal systems.

Abstract

Existing efforts to boost multimodal fusion of 3D anomaly detection (3D-AD) primarily concentrate on devising more effective multimodal fusion strategies. However, little attention was devoted to analyzing the role of multimodal fusion architecture (topology) design in contributing to 3D-AD. In this paper, we aim to bridge this gap and present a systematic study on the impact of multimodal fusion architecture design on 3D-AD. This work considers the multimodal fusion architecture design at the intra-module fusion level, i.e., independent modality-specific modules, involving early, middle or late multimodal features with specific fusion operations, and also at the inter-module fusion level, i.e., the strategies to fuse those modules. In both cases, we first derive insights through theoretically and experimentally exploring how architectural designs influence 3D-AD. Then, we extend SOTA neural architecture search (NAS) paradigm and propose 3D-ADNAS to simultaneously search across multimodal fusion strategies and modality-specific modules for the first time.Extensive experiments show that 3D-ADNAS obtains consistent improvements in 3D-AD across various model capacities in terms of accuracy, frame rate, and memory usage, and it exhibits great potential in dealing with few-shot 3D-AD tasks.

Revisiting Multimodal Fusion for 3D Anomaly Detection from an Architectural Perspective

TL;DR

The paper investigates how multimodal fusion topology affects 3D anomaly detection, showing that intra- and inter-module fusion designs significantly influence performance. It introduces 3D-ADNAS, a two-level NAS framework that jointly searches modality-specific modules and fusion strategies to optimize 3D-AD, achieving improvements in accuracy, speed, and memory across Eyecandies and MVTec 3D-AD, and showing potential for few-shot tasks. The approach combines theoretical DST-based insights with empirical NAS-driven architecture search to reveal a practical path toward architecture-aware 3D-AD. This work highlights the practical impact of fusion architecture design for industrial-quality 3D anomaly detection and offers a scalable methodology for deploying efficient multimodal systems.

Abstract

Existing efforts to boost multimodal fusion of 3D anomaly detection (3D-AD) primarily concentrate on devising more effective multimodal fusion strategies. However, little attention was devoted to analyzing the role of multimodal fusion architecture (topology) design in contributing to 3D-AD. In this paper, we aim to bridge this gap and present a systematic study on the impact of multimodal fusion architecture design on 3D-AD. This work considers the multimodal fusion architecture design at the intra-module fusion level, i.e., independent modality-specific modules, involving early, middle or late multimodal features with specific fusion operations, and also at the inter-module fusion level, i.e., the strategies to fuse those modules. In both cases, we first derive insights through theoretically and experimentally exploring how architectural designs influence 3D-AD. Then, we extend SOTA neural architecture search (NAS) paradigm and propose 3D-ADNAS to simultaneously search across multimodal fusion strategies and modality-specific modules for the first time.Extensive experiments show that 3D-ADNAS obtains consistent improvements in 3D-AD across various model capacities in terms of accuracy, frame rate, and memory usage, and it exhibits great potential in dealing with few-shot 3D-AD tasks.

Paper Structure

This paper contains 23 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (Left) The impact of multimodal fusion architecture design on 3D-AD performance. This shows the distribution of 3D-AD performance with variations at the intra- and inter-module fusion levels. (Right) 3D-ADNAS vs. SOTA methods in terms of accuracy, FPS, and memory usage.
  • Figure 2: The overall framework of 3D-ADNAS, where the MFN architecture design is the core of this work, which is specified as a two-level search space: at the inter-module fusion level, the early fusion cell (EAFC), middle fusion cell (MIFC), and late fusion cell (LAFC) are configured to determine optimal combination of involved features and operations; at intra-module fusion level, it aims to seek best fusion strategy to combine those modules (MSMs).
  • Figure 3: The inner structure of an MSM, where the early MSM is used as example with $\mathbb{K}=2$. Note that the three types of MSMs share a similar structure.
  • Figure 4: The impact of multimodal fusion architecture design on 3D-AD performance. Zoom in for details.
  • Figure 5: Visualizations results on MVTec 3D-AD.