Table of Contents
Fetching ...

Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano

Abstract

We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.

Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

Abstract

We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.

Paper Structure

This paper contains 42 sections, 9 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: View-dependent Artefacts. The first column shows acquisition artefacts observed in an image (top row: specular highlights, red box) and a depth map (bottom row: missing depths, blue box), dealing with two objects of SiM3D. As shown in the right column, other views of the same object are not affected by artefacts at the positions (green boxes) corresponding to those highlighted in the left column.
  • Figure 2: ModMap Training. Starting from the set of images $I$ and depths $D$ from a training sample, we select a source view $s$ and target view $t$, and forward their images $I^s, I^t$ and depths $D^s, D^t$ to the image, $\mathcal{E}_I$, and depth, $\mathcal{E}_D$, encoders, respectively, so as to compute modality-specific features, $F^s_I, F^t_I$ and $F^s_D, F^t_D$. Moreover, the one-hot encodings of the view indexes, $v_s$ and $v_t$, are fed into the feature modulators, $\Phi_I$ and $\Phi_D$, to generate modality-specific scale-and-shift parameters, $\gamma_I, \beta_I$ and $\gamma_D, \beta_D$. Then, for both modalities, the source features are scale-and-shifted to obtain modulated source features that incorporate view conditioning: $F^{s \rightarrow t}_I, F^{s \rightarrow t}_D$. The modulated features are passed as inputs to the mapping networks $\mathcal{M}_{I \rightarrow D}, \mathcal{M}_{D \rightarrow I}$ that predict the corresponding features from the other modality $\hat{F}^{s \rightarrow t}_{D}, \hat{F}^{s \rightarrow t}_{I}$. The predicted features are then compared to the actual target features $F^t_D, F^t_I$ to optimise both the modulators and the mapping networks.
  • Figure 3: ModMap Inference. We process the set of $N$ images $I^i$ and $N$ depths $D^i$, obtaining $N \times N$ anomaly maps for each of the two modalities. We ensemble the anomaly scores into $N$ refined 2D anomaly maps for each modality. Finally, we aggregate the 2D anomaly maps to obtain a 3D Anomaly Volume and an Instance-level Anomaly Score.
  • Figure 4: Rationale for ensembling. Squares represent features (green: uncorrupted, red: corrupted by image artefacts, blue: corrupted by depth artefacts, grey: incorrect prediction), while arrows show mapping predictions.
  • Figure 5: Qualitative results. Real-to-real (top) vs. Synthetic-to-real (bottom). Anomalies are highlighted by red boxes.
  • ...and 5 more figures