Retrieval Augmented Deep Anomaly Detection for Tabular Data

Hugo Thimonier; Fabrice Popineau; Arpad Rimmel; Bich-Liên Doan

Retrieval Augmented Deep Anomaly Detection for Tabular Data

Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan

TL;DR

This paper tackles anomaly detection on tabular data by introducing a retrieval-augmented reconstruction framework that uses a transformer to reconstruct masked normal samples while leveraging external retrieval modules to incorporate sample-sample dependencies. The authors compare KNN-based and three attention-based retrieval variants, finding that attention-bsim markedly improves performance over a vanilla transformer across a large tabular benchmark, with average gains of about $+4.3\%$ in F1 and $+1.2\%$ in AUROC. They show that combining feature-feature and sample-sample dependencies yields better anomaly coverage across types and datasets, and they provide guidance on hyperparameters (e.g., $k$, $\lambda$) and retrieval module placement. The approach offers a flexible, plug-in augmentation that can enhance existing deep AD methods for tabular data and suggests broader applicability of retrieval mechanisms in anomaly detection tasks.

Abstract

Deep learning for tabular data has garnered increasing attention in recent years, yet employing deep models for structured data remains challenging. While these models excel with unstructured data, their efficacy with structured data has been limited. Recent research has introduced retrieval-augmented models to address this gap, demonstrating promising results in supervised tasks such as classification and regression. In this work, we investigate using retrieval-augmented models for anomaly detection on tabular data. We propose a reconstruction-based approach in which a transformer model learns to reconstruct masked features of \textit{normal} samples. We test the effectiveness of KNN-based and attention-based modules to select relevant samples to help in the reconstruction process of the target sample. Our experiments on a benchmark of 31 tabular datasets reveal that augmenting this reconstruction-based anomaly detection (AD) method with sample-sample dependencies via retrieval modules significantly boosts performance. The present work supports the idea that retrieval module are useful to augment any deep AD method to enhance anomaly detection on tabular data.

Retrieval Augmented Deep Anomaly Detection for Tabular Data

TL;DR

in F1 and

in AUROC. They show that combining feature-feature and sample-sample dependencies yields better anomaly coverage across types and datasets, and they provide guidance on hyperparameters (e.g.,

) and retrieval module placement. The approach offers a flexible, plug-in augmentation that can enhance existing deep AD methods for tabular data and suggests broader applicability of retrieval mechanisms in anomaly detection tasks.

Abstract

Paper Structure (43 sections, 15 equations, 3 figures, 9 tables)

This paper contains 43 sections, 15 equations, 3 figures, 9 tables.

Introduction
Related works
Density estimation
Reconstruction-based methods
One-Class Classification
Self-Supervised Approaches
Retrieval modules
Method
Learning Objective
Mask Reconstruction
Retrieval methods
KNN-based module
Attention-based modules
Aggregation
Anomaly score
...and 28 more sections

Figures (3)

Figure 1: Forward pass for sample $\mathbf{z}$, see section \ref{['subsec:training_procedure']} for more detail on training procedure. In the case of no retrieval module, the prediction for a sample $\mathbf{z}$ consists of the upper part of the figure with $\lambda=0$.
Figure 2: For each of the $31$ datasets on which models were evaluated, we report the average F1-score over $20$ runs for $20$ different seeds. We refer readers to thimonier2024beyond for details on the obtained metrics and the hyperparameters used for each method. For both figures, the model displayed on the far left is the worst-performing model for the chosen metric, and the one on the far right is the best-performing model. We also highlight the metric of the best-performing model in bold.
Figure 3: Anomalies of type $1$ () require inter-sample dependencies to be correctly identified with high probability. Anomalies of type $2$ () on the other hand require inter-feature dependencies to be correctly identified.

Retrieval Augmented Deep Anomaly Detection for Tabular Data

TL;DR

Abstract

Retrieval Augmented Deep Anomaly Detection for Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)