Table of Contents
Fetching ...

RobustA: Robust Anomaly Detection in Multimodal Data

Salem AlMarri, Muhammad Irzam Liaqat, Muhammad Zaigham Zaheer, Shah Nawaz, Karthik Nandakumar, Markus Schedl

TL;DR

This work targets the practical challenge of deploying multimodal anomaly detection systems under modality corruption. It introduces RobustA, a dataset with extensive audio and visual corruptions, and a robust method that learns audio and visual features in a shared space while dynamically weighting modalities during inference. The approach demonstrates superior robustness across corruption types and levels, including extreme missing data, and shows favorable zero-shot generalization. The findings imply significant real-world impact by enabling more reliable multimodal anomaly detection in adverse environments.

Abstract

In recent years, multimodal anomaly detection methods have demonstrated remarkable performance improvements over video-only models. However, real-world multimodal data is often corrupted due to unforeseen environmental distortions. In this paper, we present the first-of-its-kind work that comprehensively investigates the adverse effects of corrupted modalities on multimodal anomaly detection task. To streamline this work, we propose RobustA, a carefully curated evaluation dataset to systematically observe the impacts of audio and visual corruptions on the overall effectiveness of anomaly detection systems. Furthermore, we propose a multimodal anomaly detection method, which shows notable resilience against corrupted modalities. The proposed method learns a shared representation space for different modalities and employs a dynamic weighting scheme during inference based on the estimated level of corruption. Our work represents a significant step forward in enabling the real-world application of multimodal anomaly detection, addressing situations where the likely events of modality corruptions occur. The proposed evaluation dataset with corrupted modalities and respective extracted features will be made publicly available.

RobustA: Robust Anomaly Detection in Multimodal Data

TL;DR

This work targets the practical challenge of deploying multimodal anomaly detection systems under modality corruption. It introduces RobustA, a dataset with extensive audio and visual corruptions, and a robust method that learns audio and visual features in a shared space while dynamically weighting modalities during inference. The approach demonstrates superior robustness across corruption types and levels, including extreme missing data, and shows favorable zero-shot generalization. The findings imply significant real-world impact by enabling more reliable multimodal anomaly detection in adverse environments.

Abstract

In recent years, multimodal anomaly detection methods have demonstrated remarkable performance improvements over video-only models. However, real-world multimodal data is often corrupted due to unforeseen environmental distortions. In this paper, we present the first-of-its-kind work that comprehensively investigates the adverse effects of corrupted modalities on multimodal anomaly detection task. To streamline this work, we propose RobustA, a carefully curated evaluation dataset to systematically observe the impacts of audio and visual corruptions on the overall effectiveness of anomaly detection systems. Furthermore, we propose a multimodal anomaly detection method, which shows notable resilience against corrupted modalities. The proposed method learns a shared representation space for different modalities and employs a dynamic weighting scheme during inference based on the estimated level of corruption. Our work represents a significant step forward in enabling the real-world application of multimodal anomaly detection, addressing situations where the likely events of modality corruptions occur. The proposed evaluation dataset with corrupted modalities and respective extracted features will be made publicly available.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Existing multimodal anomaly detection approaches wu2020notzhou2024learningtian2021weakly are not robust when subjected to corruptions in either modality, including vision (e.g., fog and motion blur) or audio (e.g., babble).
  • Figure 2: Architecture of our approach. Modality-specific features are extracted using pre-trained audio and visual encoders. A linear projection is used to match the feature dimensions. Modality embeddings are independently mapped to learn representations in a shared space. During inference, the shared learning space helps mitigate the adverse effects of modality corruptions. Moreover, a dynamic weighting scheme is utilized to adjust the weights of the corrupted modality for better anomaly detection.
  • Figure 3: Examples of a few visual and audio corruptions demonstrating the challenging scenarios presented in our proposed benchmark.
  • Figure 4: Comparison of weighting schemes (average and dynamic) with baseline concatenation approach wu2020not on two visual and one audio corruption.
  • Figure 5: Qualitative results of RobustA and the baseline on three videos taken from XD-Violence dataset. Blue represents baseline anomaly scores whereas green represents RobustA results. Dotter represents clean test samples whereas solid highlights corruption cases. As seen, our approach generally outputs comparable anomaly scores for clean and corruption cases. The baseline, while generating competitive anomaly scores for clean samples, demonstrates deteriorated performance when the input is corrupted. Red shaded area represents anomaly ground truth.