Table of Contents
Fetching ...

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

Alastair Anderberg, James Bailey, Ricardo J. G. B. Campello, Michael E. Houle, Henrique O. Marques, Miloš Radovanović, Arthur Zimek

TL;DR

The paper tackles outlier detection under varying local intrinsic dimensionality by introducing a nonparametric, dimensionality-aware scorer called DAO. DAO is grounded in Local Intrinsic Dimensionality (LID) theory and the asymptotic local density ratio (ALDR), enabling it to adapt to local dataset geometry when assessing outlierness. Empirical results across more than 800 synthetic and real datasets show that DAO outperforms traditional baselines such as LOF, SLOF, and kNN, particularly when LID varies significantly within the data; the work also evaluates how LID estimator choice impacts performance. The findings suggest that incorporating local dimensionality leads to more robust and effective outlier detection in complex, high-dimensional settings, with a public codebase to support replication and further research.

Abstract

We present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset. Using the theory of Local Intrinsic Dimensionality (LID), our 'dimensionality-aware' outlier detection method, DAO, is derived as an estimator of an asymptotic local expected density ratio involving the query point and a close neighbor drawn at random. The dimensionality-aware behavior of DAO is due to its use of local estimation of LID values in a theoretically-justified way. Through comprehensive experimentation on more than 800 synthetic and real datasets, we show that DAO significantly outperforms three popular and important benchmark outlier detection methods: Local Outlier Factor (LOF), Simplified LOF, and kNN.

Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis

TL;DR

The paper tackles outlier detection under varying local intrinsic dimensionality by introducing a nonparametric, dimensionality-aware scorer called DAO. DAO is grounded in Local Intrinsic Dimensionality (LID) theory and the asymptotic local density ratio (ALDR), enabling it to adapt to local dataset geometry when assessing outlierness. Empirical results across more than 800 synthetic and real datasets show that DAO outperforms traditional baselines such as LOF, SLOF, and kNN, particularly when LID varies significantly within the data; the work also evaluates how LID estimator choice impacts performance. The findings suggest that incorporating local dimensionality leads to more robust and effective outlier detection in complex, high-dimensional settings, with a public codebase to support replication and further research.

Abstract

We present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset. Using the theory of Local Intrinsic Dimensionality (LID), our 'dimensionality-aware' outlier detection method, DAO, is derived as an estimator of an asymptotic local expected density ratio involving the query point and a close neighbor drawn at random. The dimensionality-aware behavior of DAO is due to its use of local estimation of LID values in a theoretically-justified way. Through comprehensive experimentation on more than 800 synthetic and real datasets, we show that DAO significantly outperforms three popular and important benchmark outlier detection methods: Local Outlier Factor (LOF), Simplified LOF, and kNN.
Paper Structure (29 sections, 4 theorems, 20 equations, 3 figures, 3 tables)

This paper contains 29 sections, 4 theorems, 20 equations, 3 figures, 3 tables.

Key Result

Theorem 3.1

Let $F$ be a real-valued function that is non-zero over some open interval containing $r\in\mathbb{R}$, $r\neq 0$. If $F$ is continuously differentiable at $r$, then

Figures (3)

  • Figure 1: ROC AUC values for outlier detection performance over 480 synthetic datasets containing 2 clusters. One of the clusters ($c_1$) has intrinsic dimension fixed at 8. The intrinsic dimension of the other cluster ($c_2$) varies across the datasets ($x$-axis). The dashed vertical line indicates the reference set with both clusters sharing the same intrinsic dimension (8). The results shown are averages over 30 datasets with the same characteristics. Bars indicate standard deviation.
  • Figure 2: Differences in ROC AUC performance between $\mathop{\mathrm{\textrm{DAO}}}\nolimits_{\mathop{\mathrm{\textrm{MLE}}}\nolimits}$ and the dimensionality-unaware methods over 393 real datasets. Blue dots indicate datasets where $\mathop{\mathrm{\textrm{DAO}}}\nolimits$ outperforms its competitor, whereas red dots indicate the opposite. The 'Oracle' method indicates the best-performing competitor for each individual dataset. Color intensity is proportional to the ROC AUC difference. On the $x$- and $y$-axis we show the Moran's I autocorrelation and dispersion $R$ (mean absolute difference) of log-LID estimates, respectively.
  • Figure 3: Critical difference diagram (significance level $\alpha$ = 1e-16) of average ranks of the methods on 393 real datasets: $\mathop{\mathrm{\textrm{DAO}}}\nolimits_{\mathop{\mathrm{\textrm{MLE}}}\nolimits}$ vs. baseline competitors.

Theorems & Definitions (5)

  • Definition 3.1: Hou17a
  • Theorem 3.1: Hou17a
  • Theorem 3.2: LID Representation Hou17a
  • Theorem 3.3: Houle20
  • Theorem 4.1