Table of Contents
Fetching ...

Generative Subspace Adversarial Active Learning for Outlier Detection in Multiple Views of High-dimensional Data

Jose Cribeiro-Ramallo, Vadim Arzamasov, Federico Matteucci, Denis Wambold, Klemens Böhm

TL;DR

This work addresses outlier detection in high-dimensional tabular data by formalizing the multiple-views ($MV$) challenge and introducing Generative Subspace Adversarial Active Learning (GSAAL). GSAAL combines a single generator operating in the full feature space with multiple subspace detectors, enabling robust detection of outliers that only appear in certain views while mitigating inlier assumptions ($IA$) and the curse of dimensionality ($CD$). Theoretical results establish MV formalization, convergence guarantees for the subspace detectors, and favorable complexity, complemented by extensive experiments on synthetic and real datasets showing superior MV-aware performance and scalable inference. The method demonstrates practical utility for large, high-dimensional datasets and provides publicly available code for reproducibility. Overall, GSAAL advances unsupervised OD by explicitly addressing MV alongside IA and CD, offering a scalable, effective option for real-world tabular data tasks.

Abstract

Outlier detection in high-dimensional tabular data is an important task in data mining, essential for many downstream tasks and applications. Existing unsupervised outlier detection algorithms face one or more problems, including inlier assumption (IA), curse of dimensionality (CD), and multiple views (MV). To address these issues, we introduce Generative Subspace Adversarial Active Learning (GSAAL), a novel approach that uses a Generative Adversarial Network with multiple adversaries. These adversaries learn the marginal class probability functions over different data subspaces, while a single generator in the full space models the entire distribution of the inlier class. GSAAL is specifically designed to address the MV limitation while also handling the IA and CD, being the only method to do so. We provide a comprehensive mathematical formulation of MV, convergence guarantees for the discriminators, and scalability results for GSAAL. Our extensive experiments demonstrate the effectiveness and scalability of GSAAL, highlighting its superior performance compared to other popular OD methods, especially in MV scenarios.

Generative Subspace Adversarial Active Learning for Outlier Detection in Multiple Views of High-dimensional Data

TL;DR

This work addresses outlier detection in high-dimensional tabular data by formalizing the multiple-views () challenge and introducing Generative Subspace Adversarial Active Learning (GSAAL). GSAAL combines a single generator operating in the full feature space with multiple subspace detectors, enabling robust detection of outliers that only appear in certain views while mitigating inlier assumptions () and the curse of dimensionality (). Theoretical results establish MV formalization, convergence guarantees for the subspace detectors, and favorable complexity, complemented by extensive experiments on synthetic and real datasets showing superior MV-aware performance and scalable inference. The method demonstrates practical utility for large, high-dimensional datasets and provides publicly available code for reproducibility. Overall, GSAAL advances unsupervised OD by explicitly addressing MV alongside IA and CD, offering a scalable, effective option for real-world tabular data tasks.

Abstract

Outlier detection in high-dimensional tabular data is an important task in data mining, essential for many downstream tasks and applications. Existing unsupervised outlier detection algorithms face one or more problems, including inlier assumption (IA), curse of dimensionality (CD), and multiple views (MV). To address these issues, we introduce Generative Subspace Adversarial Active Learning (GSAAL), a novel approach that uses a Generative Adversarial Network with multiple adversaries. These adversaries learn the marginal class probability functions over different data subspaces, while a single generator in the full space models the entire distribution of the inlier class. GSAAL is specifically designed to address the MV limitation while also handling the IA and CD, being the only method to do so. We provide a comprehensive mathematical formulation of MV, convergence guarantees for the discriminators, and scalability results for GSAAL. Our extensive experiments demonstrate the effectiveness and scalability of GSAAL, highlighting its superior performance compared to other popular OD methods, especially in MV scenarios.
Paper Structure (42 sections, 6 theorems, 36 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 6 theorems, 36 equations, 10 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Let $\mathbf{x}$ and $\mathbf{u}$ be as before with $p_\mathbf{x}$ myopic to the views of $\mathbf{u}$. Consider a set of independent realizations of $\mathbf{u}$: $\{u_i\}_{i=1}^{k}$. Then $\frac{1}{k} \sum_{i} p_{u_i\mathbf{x}}(u_ix)$ is a sufficient statistic for $p_{\mathbf{ux}}(ux)$.

Figures (10)

  • Figure 1: Scatterplots of the dataset from example \ref{['ex::intro']}.
  • Figure 2: Projected classification boundaries for datasets banana, spiral, and star.
  • Figure 3: Boxplots of each method's rank in the real-world datasets.
  • Figure 4: Performance of the detector with different values of $k$.
  • Figure 5: Plots of different performance metrics for scalability
  • ...and 5 more figures

Theorems & Definitions (12)

  • Example 1: Effect of MV, IA and CD
  • Definition 1: myopic distribution
  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Proposition 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • ...and 2 more