Table of Contents
Fetching ...

Efficient Generation of Hidden Outliers for Improved Outlier Detection

Jose Cribeiro-Ramallo, Vadim Arzamasov, Klemens Böhm

TL;DR

This work tackles the challenge of detecting outliers in high-dimensional data by leveraging the multi-view property, where outliers manifest in subspaces. It introduces Bisect, a hyperparameter-free method that generates hidden outliers via a root-finding approach along convex combinations and a cut trick, backed by a hidden-outlier existence proposition. Bisect provides guaranteed generation of hidden outliers and demonstrates clear efficiency gains over prior methods, enabling improved one-class and supervised outlier detection when synthetic hidden outliers are used for self-supervised learning or oversampling. Experiments on synthetic and real datasets show Bisect outperforms baselines in both generation time and downstream detection performance, with broader applicability and reproducibility through released code. The approach offers practical impact for robust anomaly detection in high-dimensional domains, where multi-view outliers are central to reliable modeling.

Abstract

Outlier generation is a popular technique used for solving important outlier detection tasks. Generating outliers with realistic behavior is challenging. Popular existing methods tend to disregard the 'multiple views' property of outliers in high-dimensional spaces. The only existing method accounting for this property falls short in efficiency and effectiveness. We propose BISECT, a new outlier generation method that creates realistic outliers mimicking said property. To do so, BISECT employs a novel proposition introduced in this article stating how to efficiently generate said realistic outliers. Our method has better guarantees and complexity than the current methodology for recreating 'multiple views'. We use the synthetic outliers generated by BISECT to effectively enhance outlier detection in diverse datasets, for multiple use cases. For instance, oversampling with BISECT reduced the error by up to 3 times when compared with the baselines.

Efficient Generation of Hidden Outliers for Improved Outlier Detection

TL;DR

This work tackles the challenge of detecting outliers in high-dimensional data by leveraging the multi-view property, where outliers manifest in subspaces. It introduces Bisect, a hyperparameter-free method that generates hidden outliers via a root-finding approach along convex combinations and a cut trick, backed by a hidden-outlier existence proposition. Bisect provides guaranteed generation of hidden outliers and demonstrates clear efficiency gains over prior methods, enabling improved one-class and supervised outlier detection when synthetic hidden outliers are used for self-supervised learning or oversampling. Experiments on synthetic and real datasets show Bisect outperforms baselines in both generation time and downstream detection performance, with broader applicability and reproducibility through released code. The approach offers practical impact for robust anomaly detection in high-dimensional domains, where multi-view outliers are central to reliable modeling.

Abstract

Outlier generation is a popular technique used for solving important outlier detection tasks. Generating outliers with realistic behavior is challenging. Popular existing methods tend to disregard the 'multiple views' property of outliers in high-dimensional spaces. The only existing method accounting for this property falls short in efficiency and effectiveness. We propose BISECT, a new outlier generation method that creates realistic outliers mimicking said property. To do so, BISECT employs a novel proposition introduced in this article stating how to efficiently generate said realistic outliers. Our method has better guarantees and complexity than the current methodology for recreating 'multiple views'. We use the synthetic outliers generated by BISECT to effectively enhance outlier detection in diverse datasets, for multiple use cases. For instance, oversampling with BISECT reduced the error by up to 3 times when compared with the baselines.
Paper Structure (31 sections, 7 theorems, 17 equations, 4 figures, 8 tables, 2 algorithms)

This paper contains 31 sections, 7 theorems, 17 equations, 4 figures, 8 tables, 2 algorithms.

Key Result

Proposition 1

("Hidden outlier existence"): Let $x$ and $y$ be points in the previously defined metric space such that $x \in R(\mathcal{M})$ and $y \notin R(\mathcal{M})$. Assume that there exists a point $z$ in the convex combination of $x$ and $y$ such as $z \in \partial R(\mathcal{M}) \Rightarrow z \notin \pa

Figures (4)

  • Figure 1: Example of regions of hidden outliers.
  • Figure 2: Examples of different hidden regions. $R(\mathcal{M})$ is marked in green, $R(\mathcal{E}_{\mathcal{M}})$ in grey, and the convex combination of $x$ and $y$ is represented with a dashed line.
  • Figure 3: An example of the "cut trick" .
  • Figure 4: Time to generate 500 hidden outliers (in seconds) in synthetic and real data contingent on feature count.

Theorems & Definitions (15)

  • Example 1
  • Definition 1
  • Proposition 1
  • Example 2
  • Theorem 1
  • Example 3
  • Proposition 2
  • Proposition 3
  • proof
  • Lemma 1
  • ...and 5 more