Table of Contents
Fetching ...

Outlier Detection with Cluster Catch Digraphs

Rui Shi, Nedret Billor, Elvan Ceyhan

TL;DR

The paper tackles high-dimensional outlier detection with clusters of arbitrary shape by developing Cluster Catch Digraphs (CCDs) and four core algorithms: RU-MCCD, UN-MCCD, SU-MCCD, and SUN-MCCD. RU-MCCD and UN-MCCD rely on spherical covering balls (KS-CCD/RK-CCD or NN-based tests) to identify dominating regions and detect outliers via mutual catch graphs, while SU-MCCD and SUN-MCCD extend to flexible cluster shapes by aggregating multiple covering balls and shape-adaptive mechanisms. The authors also introduce OOS and IOS scores to quantify outlierness, and provide both space/time complexity analyses and extensive Monte Carlo simulations, including high-dimension Gaussian and uniform clusters, as well as real-data benchmarks, showing SUN-MCCD generally offers the best robustness and accuracy, particularly in high dimensions. The work contributes a versatile, largely parameter-free toolkit for outlier detection that scales to complex data and supports clustering insights, with potential impact across finance, bioinformatics, cybersecurity, and other domains where detecting anomalies in complex data is critical.

Abstract

This paper introduces a novel family of outlier detection algorithms based on Cluster Catch Digraphs (CCDs), specifically tailored to address the challenges of high dimensionality and varying cluster shapes, which deteriorate the performance of most traditional outlier detection methods. We propose the Uniformity-Based CCD with Mutual Catch Graph (U-MCCD), the Uniformity- and Neighbor-Based CCD with Mutual Catch Graph (UN-MCCD), and their shape-adaptive variants (SU-MCCD and SUN-MCCD), which are designed to detect outliers in data sets with arbitrary cluster shapes and high dimensions. We present the advantages and shortcomings of these algorithms and provide the motivation or need to define each particular algorithm. Through comprehensive Monte Carlo simulations, we assess their performance and demonstrate the robustness and effectiveness of our algorithms across various settings and contamination levels. We also illustrate the use of our algorithms on various real-life data sets. The U-MCCD algorithm efficiently identifies outliers while maintaining high true negative rates, and the SU-MCCD algorithm shows substantial improvement in handling non-uniform clusters. Additionally, the UN-MCCD and SUN-MCCD algorithms address the limitations of existing methods in high-dimensional spaces by utilizing Nearest Neighbor Distances (NND) for clustering and outlier detection. Our results indicate that these novel algorithms offer substantial advancements in the accuracy and adaptability of outlier detection, providing a valuable tool for various real-world applications. Keyword: Outlier detection, Graph-based clustering, Cluster catch digraphs, $k$-nearest-neighborhood, Mutual catch graphs, Nearest neighbor distance.

Outlier Detection with Cluster Catch Digraphs

TL;DR

The paper tackles high-dimensional outlier detection with clusters of arbitrary shape by developing Cluster Catch Digraphs (CCDs) and four core algorithms: RU-MCCD, UN-MCCD, SU-MCCD, and SUN-MCCD. RU-MCCD and UN-MCCD rely on spherical covering balls (KS-CCD/RK-CCD or NN-based tests) to identify dominating regions and detect outliers via mutual catch graphs, while SU-MCCD and SUN-MCCD extend to flexible cluster shapes by aggregating multiple covering balls and shape-adaptive mechanisms. The authors also introduce OOS and IOS scores to quantify outlierness, and provide both space/time complexity analyses and extensive Monte Carlo simulations, including high-dimension Gaussian and uniform clusters, as well as real-data benchmarks, showing SUN-MCCD generally offers the best robustness and accuracy, particularly in high dimensions. The work contributes a versatile, largely parameter-free toolkit for outlier detection that scales to complex data and supports clustering insights, with potential impact across finance, bioinformatics, cybersecurity, and other domains where detecting anomalies in complex data is critical.

Abstract

This paper introduces a novel family of outlier detection algorithms based on Cluster Catch Digraphs (CCDs), specifically tailored to address the challenges of high dimensionality and varying cluster shapes, which deteriorate the performance of most traditional outlier detection methods. We propose the Uniformity-Based CCD with Mutual Catch Graph (U-MCCD), the Uniformity- and Neighbor-Based CCD with Mutual Catch Graph (UN-MCCD), and their shape-adaptive variants (SU-MCCD and SUN-MCCD), which are designed to detect outliers in data sets with arbitrary cluster shapes and high dimensions. We present the advantages and shortcomings of these algorithms and provide the motivation or need to define each particular algorithm. Through comprehensive Monte Carlo simulations, we assess their performance and demonstrate the robustness and effectiveness of our algorithms across various settings and contamination levels. We also illustrate the use of our algorithms on various real-life data sets. The U-MCCD algorithm efficiently identifies outliers while maintaining high true negative rates, and the SU-MCCD algorithm shows substantial improvement in handling non-uniform clusters. Additionally, the UN-MCCD and SUN-MCCD algorithms address the limitations of existing methods in high-dimensional spaces by utilizing Nearest Neighbor Distances (NND) for clustering and outlier detection. Our results indicate that these novel algorithms offer substantial advancements in the accuracy and adaptability of outlier detection, providing a valuable tool for various real-world applications. Keyword: Outlier detection, Graph-based clustering, Cluster catch digraphs, -nearest-neighborhood, Mutual catch graphs, Nearest neighbor distance.
Paper Structure (36 sections, 6 theorems, 7 equations, 28 figures, 35 tables, 10 algorithms)

This paper contains 36 sections, 6 theorems, 7 equations, 28 figures, 35 tables, 10 algorithms.

Key Result

Theorem 3.1

Given a data set $\mathcal{X} \subset \mathbb{R}^d$ of size $n$ ($d<n$). Suppose we simulate $M$ data sets from the (estimated) $F$ and $S$, then the time complexity of Algorithm alg:DMCG_Algo is $O(M(n^2(d+\log n))+M\log M)$.

Figures (28)

  • Figure 1: (a) A data set with 45 regular points (black) generated uniformly within a unit circle $B((0,0),1)$, and 5 outliers (red crosses) are drawn (uniformly) from another unit circle $B((3,0),1)$ that is 3 units away from the first one. (b) A data set that consists of 45 regular points (black) which are distributed uniformly within a unit circle $B((0,0),1)$, and $5$ outliers (red) that are drawn uniformly in the annular region between $B((0,0),1.5)$ and $B((0,0),3)$. (c) & (d) The connected components returned by the D-MCG algorithm, the circles are the estimated support for regular data points, which are obtained by SVDD with the polynomial kernel of degree 1.
  • Figure 2: Some simulated uniform data sets, black points are regular data points, red crosses are outliers, (a) 2 clusters, 5% outliers, $n=100$. (b) 2 clusters with different sizes, 5% outliers, $n=100$. (c) 3 clusters, 10% outliers, $n=100$. (d) 3 clusters with different sizes, 10% outliers, $n=100$. (e) 4 clusters, 10% outliers, $n=200$. (f) 4 clusters with different sizes, 5% outliers, $n=200$
  • Figure 3: The connected components and outliers determined by the RU-MCCD algorithm (Algorithm \ref{['alg:RUMCCD']}) for the settings in Figure \ref{['2d_fig_RUMCCD_Algo2']}. The solid black circles are the dominating covering balls of RK-CCDs.
  • Figure 4: A illustration of the SU-MCCD algorithm.
  • Figure 5: Two realizations of the simulation settings described in Section \ref{['sec:Uni_General_Settings_Des']} with $n=100$ and $200$ respectively. Each data set has 2 clusters of the same size but different intensities. Black points are regular data points, and red crosses are outliers.
  • ...and 23 more figures

Theorems & Definitions (7)

  • Definition 3.1: Mutual Catch Graphs (MCGs)
  • Theorem 3.1: Time Complexity of Algorithm \ref{['alg:DMCG_Algo']}
  • Theorem 3.2: Time Complexity of Algorithm \ref{['alg:RUMCCD']}
  • Theorem 3.3: The Time Complexity of Algorithm \ref{['alg:SUMCCD_Algo']}
  • Theorem 3.4: Time Complexity of Algorithm \ref{['alg:UN-CCDs']}
  • Theorem 3.5: Time Complexity of Algorithm \ref{['alg:UNMCCD_Algo']}
  • Theorem 3.6: Time Complexity of Algorithm \ref{['alg:SUN-MCCD']}