View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis
Subin Varghese, Vedhus Hoskere
TL;DR
Scene AD addresses unsupervised pixel-level anomaly localization under unconstrained, multi-view, multi-object conditions. The authors propose OmniAD, a refined Reverse Distillation with a ResNeXt backbone and ERF-expanding student attention, augmented by NeRF-based view synthesis strategies (INV and QANV) to improve generalization across viewpoints. They introduce ToyCity, a real-image multi-object multi-view benchmark, and demonstrate that OmniAD with NVS augmentations achieves a substantial improvement over baselines (e.g., a 64.33% relative gain in pixel-wise $F_1$ over RD without augmentation) and generalizes to MAD-Real and fixed-view datasets like MVTec-AD. The work provides the Scene AD task definition, the ToyCity benchmark, view-synthesis augmentation methods, and the OmniAD model as a robust baseline for view-invariant anomaly detection in real-world scenes.
Abstract
The built environment, encompassing critical infrastructure such as bridges and buildings, requires diligent monitoring of unexpected anomalies or deviations from a normal state in captured imagery. Anomaly detection methods could aid in automating this task; however, deploying anomaly detection effectively in such environments presents significant challenges that have not been evaluated before. These challenges include camera viewpoints that vary, the presence of multiple objects within a scene, and the absence of labeled anomaly data for training. To address these comprehensively, we introduce and formalize Scene Anomaly Detection (Scene AD) as the task of unsupervised, pixel-wise anomaly localization under these specific real-world conditions. Evaluating progress in Scene AD required the development of ToyCity, the first multi-object, multi-view real-image dataset, for unsupervised anomaly detection. Our initial evaluations using ToyCity revealed that established anomaly detection baselines struggle to achieve robust pixel-level localization. To address this, two data augmentation strategies were created to generate additional synthetic images of non-anomalous regions to enhance generalizability. However, the addition of these synthetic images alone only provided minor improvements. Thus, OmniAD, a refinement of the Reverse Distillation methodology, was created to establish a stronger baseline. Our experiments demonstrate that OmniAD, when used with augmented views, yields a 64.33\% increase in pixel-wise \(F_1\) score over Reverse Distillation with no augmentation. Collectively, this work offers the Scene AD task definition, the ToyCity benchmark, the view synthesis augmentation approaches, and the OmniAD method. Project Page: https://drags99.github.io/OmniAD/
