Retrieval Augmented Deep Anomaly Detection for Tabular Data
Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan
TL;DR
This paper tackles anomaly detection on tabular data by introducing a retrieval-augmented reconstruction framework that uses a transformer to reconstruct masked normal samples while leveraging external retrieval modules to incorporate sample-sample dependencies. The authors compare KNN-based and three attention-based retrieval variants, finding that attention-bsim markedly improves performance over a vanilla transformer across a large tabular benchmark, with average gains of about $+4.3\%$ in F1 and $+1.2\%$ in AUROC. They show that combining feature-feature and sample-sample dependencies yields better anomaly coverage across types and datasets, and they provide guidance on hyperparameters (e.g., $k$, $\lambda$) and retrieval module placement. The approach offers a flexible, plug-in augmentation that can enhance existing deep AD methods for tabular data and suggests broader applicability of retrieval mechanisms in anomaly detection tasks.
Abstract
Deep learning for tabular data has garnered increasing attention in recent years, yet employing deep models for structured data remains challenging. While these models excel with unstructured data, their efficacy with structured data has been limited. Recent research has introduced retrieval-augmented models to address this gap, demonstrating promising results in supervised tasks such as classification and regression. In this work, we investigate using retrieval-augmented models for anomaly detection on tabular data. We propose a reconstruction-based approach in which a transformer model learns to reconstruct masked features of \textit{normal} samples. We test the effectiveness of KNN-based and attention-based modules to select relevant samples to help in the reconstruction process of the target sample. Our experiments on a benchmark of 31 tabular datasets reveal that augmenting this reconstruction-based anomaly detection (AD) method with sample-sample dependencies via retrieval modules significantly boosts performance. The present work supports the idea that retrieval module are useful to augment any deep AD method to enhance anomaly detection on tabular data.
