Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Alastair Anderberg; James Bailey; Ricardo J. G. B. Campello; Michael E. Houle; Henrique O. Marques; Miloš Radovanović; Arthur Zimek

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Alastair Anderberg, James Bailey, Ricardo J. G. B. Campello, Michael E. Houle, Henrique O. Marques, Miloš Radovanović, Arthur Zimek

TL;DR

The paper tackles outlier detection under varying local intrinsic dimensionality by introducing a nonparametric, dimensionality-aware scorer called DAO. DAO is grounded in Local Intrinsic Dimensionality (LID) theory and the asymptotic local density ratio (ALDR), enabling it to adapt to local dataset geometry when assessing outlierness. Empirical results across more than 800 synthetic and real datasets show that DAO outperforms traditional baselines such as LOF, SLOF, and kNN, particularly when LID varies significantly within the data; the work also evaluates how LID estimator choice impacts performance. The findings suggest that incorporating local dimensionality leads to more robust and effective outlier detection in complex, high-dimensional settings, with a public codebase to support replication and further research.

Abstract

We present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset. Using the theory of Local Intrinsic Dimensionality (LID), our 'dimensionality-aware' outlier detection method, DAO, is derived as an estimator of an asymptotic local expected density ratio involving the query point and a close neighbor drawn at random. The dimensionality-aware behavior of DAO is due to its use of local estimation of LID values in a theoretically-justified way. Through comprehensive experimentation on more than 800 synthetic and real datasets, we show that DAO significantly outperforms three popular and important benchmark outlier detection methods: Local Outlier Factor (LOF), Simplified LOF, and kNN.

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

TL;DR

Abstract

Paper Structure (29 sections, 4 theorems, 20 equations, 3 figures, 3 tables)

This paper contains 29 sections, 4 theorems, 20 equations, 3 figures, 3 tables.

Introduction
Related Work
Background
Local Outlier Factor.
Local Intrinsic Dimensionality.
LID Representation Theorem.
The Dimensionality-Aware Outlier Model
Asymptotic Local Density Ratio.
Dimensionality-Aware Reformulation of ALDR.
The Dimensionality-Aware Outlierness Criterion.
Evaluation
Methods and Parameters
Outlier detection algorithms.
LID estimators.
Implementation and code.
...and 14 more sections

Key Result

Theorem 3.1

Let $F$ be a real-valued function that is non-zero over some open interval containing $r\in\mathbb{R}$, $r\neq 0$. If $F$ is continuously differentiable at $r$, then

Figures (3)

Figure 1: ROC AUC values for outlier detection performance over 480 synthetic datasets containing 2 clusters. One of the clusters ($c_1$) has intrinsic dimension fixed at 8. The intrinsic dimension of the other cluster ($c_2$) varies across the datasets ($x$-axis). The dashed vertical line indicates the reference set with both clusters sharing the same intrinsic dimension (8). The results shown are averages over 30 datasets with the same characteristics. Bars indicate standard deviation.
Figure 2: Differences in ROC AUC performance between $\mathop{\mathrm{\textrm{DAO}}}\nolimits_{\mathop{\mathrm{\textrm{MLE}}}\nolimits}$ and the dimensionality-unaware methods over 393 real datasets. Blue dots indicate datasets where $\mathop{\mathrm{\textrm{DAO}}}\nolimits$ outperforms its competitor, whereas red dots indicate the opposite. The 'Oracle' method indicates the best-performing competitor for each individual dataset. Color intensity is proportional to the ROC AUC difference. On the $x$- and $y$-axis we show the Moran's I autocorrelation and dispersion $R$ (mean absolute difference) of log-LID estimates, respectively.
Figure 3: Critical difference diagram (significance level $\alpha$ = 1e-16) of average ranks of the methods on 393 real datasets: $\mathop{\mathrm{\textrm{DAO}}}\nolimits_{\mathop{\mathrm{\textrm{MLE}}}\nolimits}$ vs. baseline competitors.

Theorems & Definitions (5)

Definition 3.1: Hou17a
Theorem 3.1: Hou17a
Theorem 3.2: LID Representation Hou17a
Theorem 3.3: Houle20
Theorem 4.1

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

TL;DR

Abstract

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)