Table of Contents
Fetching ...

Adversarial Subspace Generation for Outlier Detection in High-Dimensional Data

Jose Cribeiro-Ramallo, Federico Matteucci, Paul Enciu, Alexander Jenke, Vadim Arzamasov, Thorsten Strufe, Klemens Böhm

TL;DR

The paper tackles outlier detection in high-dimensional data by formalizing the Multiple Views (MV) phenomenon through Myopic Subspace Theory (MST). MST casts subspace selection as a stochastic optimization over lens operators that preserve the data distribution, and proves convergence guarantees under mild conditions. To solve the optimization, it introduces V-GAN, a generative framework that learns to sample subspace projections via Maximum Mean Discrepancy (MMD) objectives, with an axis-parallel instantiation and optional kernel learning. Empirically, V-GAN substantially improves one-class classification performance across 42 real-world datasets and scales favorably against baselines, while synthetic experiments validate accurate subspace recovery. The work suggests broad potential for MST beyond tabular data, including applications to contrastive learning and other modalities.

Abstract

Outlier detection in high-dimensional tabular data is challenging since data is often distributed across multiple lower-dimensional subspaces -- a phenomenon known as the Multiple Views effect (MV). This effect led to a large body of research focused on mining such subspaces, known as subspace selection. However, as the precise nature of the MV effect was not well understood, traditional methods had to rely on heuristic-driven search schemes that struggle to accurately capture the true structure of the data. Properly identifying these subspaces is critical for unsupervised tasks such as outlier detection or clustering, where misrepresenting the underlying data structure can hinder the performance. We introduce Myopic Subspace Theory (MST), a new theoretical framework that mathematically formulates the Multiple Views effect and writes subspace selection as a stochastic optimization problem. Based on MST, we introduce V-GAN, a generative method trained to solve such an optimization problem. This approach avoids any exhaustive search over the feature space while ensuring that the intrinsic data structure is preserved. Experiments on 42 real-world datasets show that using V-GAN subspaces to build ensemble methods leads to a significant increase in one-class classification performance -- compared to existing subspace selection, feature selection, and embedding methods. Further experiments on synthetic data show that V-GAN identifies subspaces more accurately while scaling better than other relevant subspace selection methods. These results confirm the theoretical guarantees of our approach and also highlight its practical viability in high-dimensional settings.

Adversarial Subspace Generation for Outlier Detection in High-Dimensional Data

TL;DR

The paper tackles outlier detection in high-dimensional data by formalizing the Multiple Views (MV) phenomenon through Myopic Subspace Theory (MST). MST casts subspace selection as a stochastic optimization over lens operators that preserve the data distribution, and proves convergence guarantees under mild conditions. To solve the optimization, it introduces V-GAN, a generative framework that learns to sample subspace projections via Maximum Mean Discrepancy (MMD) objectives, with an axis-parallel instantiation and optional kernel learning. Empirically, V-GAN substantially improves one-class classification performance across 42 real-world datasets and scales favorably against baselines, while synthetic experiments validate accurate subspace recovery. The work suggests broad potential for MST beyond tabular data, including applications to contrastive learning and other modalities.

Abstract

Outlier detection in high-dimensional tabular data is challenging since data is often distributed across multiple lower-dimensional subspaces -- a phenomenon known as the Multiple Views effect (MV). This effect led to a large body of research focused on mining such subspaces, known as subspace selection. However, as the precise nature of the MV effect was not well understood, traditional methods had to rely on heuristic-driven search schemes that struggle to accurately capture the true structure of the data. Properly identifying these subspaces is critical for unsupervised tasks such as outlier detection or clustering, where misrepresenting the underlying data structure can hinder the performance. We introduce Myopic Subspace Theory (MST), a new theoretical framework that mathematically formulates the Multiple Views effect and writes subspace selection as a stochastic optimization problem. Based on MST, we introduce V-GAN, a generative method trained to solve such an optimization problem. This approach avoids any exhaustive search over the feature space while ensuring that the intrinsic data structure is preserved. Experiments on 42 real-world datasets show that using V-GAN subspaces to build ensemble methods leads to a significant increase in one-class classification performance -- compared to existing subspace selection, feature selection, and embedding methods. Further experiments on synthetic data show that V-GAN identifies subspaces more accurately while scaling better than other relevant subspace selection methods. These results confirm the theoretical guarantees of our approach and also highlight its practical viability in high-dimensional settings.

Paper Structure

This paper contains 55 sections, 4 theorems, 33 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1

Consider $\mathcal{H}$ a RKHS with a characteristic kernel $\kappa$; and $\mathbf{x}$, $\mathbf{U}$ and MMD as previously defined. Further, consider $\mathbf{V}$ to be a lens operator for $\mathbf{x}$. Then,

Figures (12)

  • Figure 1: (a) Population from example \ref{['ex:intro:ex2']} and the performance of the SotA for subspace search in it. We colored in blue those in subspace $U_1$ and purple those in subspace $U_2$. (b) The normalized weights $\hat{F}$ and $1-\hat{F}$ assigned by GMD trittenbach_dimension-based_2019 to subspaces $S_1$ and $S_2$ should be as close as possible to $F$ and $1-F$, i.e., the dashed grey lines.
  • Figure 2: Diagram of the network and the training without and with kernel learning, from left to right, respectively. (a) The network $G_\theta$ is trained to minimize the loss $\mathcal{L}_\kappa(\theta)$ --- the empirical estimator of the MMD gretton_kernel_2012 between samples of $\mathbf{x}$ and $G_\theta(\mathbf{z})\mathbf{x}$ using kernel $\kappa$. (b) The network $G_\theta$ is trained to minimize the same loss as before, but with $\kappa$ composed with $\mathcal{E}_\phi$, the encoder part of an autoencoder. At the same time, $\mathcal{E}_\phi$ is trained to maximize $\mathcal{L_{\kappa\circ\mathcal{E}_\phi}}$, while minimizing the reconstruction loss.
  • Figure 3: Comparison of the relative scores for each subspace across different values of $F$.
  • Figure 4: Boxplots of ranks of the comparison with baselines using myopic datasets with different numbers of features. The bins contained 3, 7, 5, and 6 datasets, respectively.
  • Figure 5: Boxplots of ranks of the comparison with our competitors using myopic datasets.
  • ...and 7 more figures

Theorems & Definitions (13)

  • Example 1
  • Definition 1: cribeiroramallo2024
  • Definition 2: Myopicity of a distribution
  • Definition 3: Definition 2 in gretton_kernel_2012
  • Lemma 1
  • Theorem 2
  • Corollary 3: Convergence to a lens operator
  • proof
  • proof
  • proof
  • ...and 3 more