Table of Contents
Fetching ...

Beyond Individual Input for Deep Anomaly Detection on Tabular Data

Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan

TL;DR

This work addresses tabular anomaly detection by introducing a reconstruction-based method that uses Non-Parametric Transformers (NPTs) to jointly model feature-feature and sample-sample dependencies. By processing masked feature reconstructions with a non-parametric inference pipeline that leverages the entire training set, the approach derives an anomaly score that reflects both inter-feature and inter-sample relations. The model achieves state-of-the-art results across 31 tabular datasets and is supported by ablation studies showing that combining both dependency types is crucial for performance; robustness to small training contamination is also demonstrated. While the method incurs higher computational costs due to its non-parametric nature, it offers a principled framework for exploiting rich dependency structures in tabular anomaly detection with strong empirical gains.

Abstract

Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity. In this paper, we propose a novel deep anomaly detection method for tabular data that leverages Non-Parametric Transformers (NPTs), a model initially proposed for supervised tasks, to capture both feature-feature and sample-sample dependencies. In a reconstruction-based framework, we train an NPT to reconstruct masked features of normal samples. In a non-parametric fashion, we leverage the whole training set during inference and use the model's ability to reconstruct the masked features to generate an anomaly score. To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies for anomaly detection on tabular datasets. Through extensive experiments on 31 benchmark tabular datasets, we demonstrate that our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively. Our ablation study further proves that modeling both types of dependencies is crucial for anomaly detection on tabular data.

Beyond Individual Input for Deep Anomaly Detection on Tabular Data

TL;DR

This work addresses tabular anomaly detection by introducing a reconstruction-based method that uses Non-Parametric Transformers (NPTs) to jointly model feature-feature and sample-sample dependencies. By processing masked feature reconstructions with a non-parametric inference pipeline that leverages the entire training set, the approach derives an anomaly score that reflects both inter-feature and inter-sample relations. The model achieves state-of-the-art results across 31 tabular datasets and is supported by ablation studies showing that combining both dependency types is crucial for performance; robustness to small training contamination is also demonstrated. While the method incurs higher computational costs due to its non-parametric nature, it offers a principled framework for exploiting rich dependency structures in tabular anomaly detection with strong empirical gains.

Abstract

Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity. In this paper, we propose a novel deep anomaly detection method for tabular data that leverages Non-Parametric Transformers (NPTs), a model initially proposed for supervised tasks, to capture both feature-feature and sample-sample dependencies. In a reconstruction-based framework, we train an NPT to reconstruct masked features of normal samples. In a non-parametric fashion, we leverage the whole training set during inference and use the model's ability to reconstruct the masked features to generate an anomaly score. To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies for anomaly detection on tabular datasets. Through extensive experiments on 31 benchmark tabular datasets, we demonstrate that our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively. Our ablation study further proves that modeling both types of dependencies is crucial for anomaly detection on tabular data.
Paper Structure (42 sections, 21 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 42 sections, 21 equations, 4 figures, 14 tables, 1 algorithm.

Figures (4)

  • Figure 1: NPT-AD inference pipeline. In step (a), mask $j$ is applied to each validation sample. We construct a matrix $\mathbf{X}$ composed of the masked validation samples and the whole unmasked training set. In step (b), we feed $\mathbf{X}$ to the Non-Parametric Transformer (NPT), which tries to reconstruct the masked features for each validation sample. On top of the learned feature-feature interactions, NPT will use the unmasked training samples to reconstruct the mask features. In step (c), we compute the reconstruction error that we later aggregate in the NPT-AD score.
  • Figure 2: For each of the $31$ datasets on which models were evaluated, we compute the average F1-score over $20$ runs for $20$ different seeds. We report on figure \ref{['fig:sub_avg_F1']} the average F1-score over all datasets for each tested model. We report on figure \ref{['fig:sub_avg_rk']} the average rank over the 31 datasets. For both figures, the model displayed on the far left is the worst-performing model for the chosen metric, and on the far right is the best-performing model. We also highlight the metric of the best-performing model in bold. See tables \ref{['tab:odds_f1-1']} and \ref{['tab:odds_f1-2']} in appendix \ref{['appendix:add_rez']} for the details of the obtained metrics.
  • Figure 3: Training set contamination impact on the F1-score and AUROC. Each model was trained $5$ times for each contamination share. The architecture used for NPT-AD is the same as for all experiments (see section \ref{['sec:experiments']}). The NPT and Transformer were trained for $100$ epochs with batch size equal to the dataset size, with learning rate $0.01$, optimizer LAMB lamb with $\beta = (0.9, 0.999)$, per-feature embedding dimension $16$, $r$ set to 1, and masking probability $p_{mask}=0.15$. NeuTraL-AD and GOAD were trained with hyperparameters as for the thyroid dataset in the original papers and shenkar2022anomaly with its default parameters in their implementation.
  • Figure 4: While Mask-KNN only relies on sample-sample dependencies and the vanilla transformer only attends to feature-feature dependencies, NPT-AD combines both for anomaly detection.