Table of Contents
Fetching ...

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

Braulio V. Sánchez Vinces, Robson L. F. Cordeiro, Christos Faloutsos

TL;DR

Mccatch is presented: a new algorithm that detects microclusters by leveraging the proposed ‘Oracle’ plot (1NN Distance versus Group 1NN Distance) and it outperforms 11 other methods, especially when the data has non-singleton microclusters or is nondimensional.

Abstract

How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

TL;DR

Mccatch is presented: a new algorithm that detects microclusters by leveraging the proposed ‘Oracle’ plot (1NN Distance versus Group 1NN Distance) and it outperforms 11 other methods, especially when the data has non-singleton microclusters or is nondimensional.

Abstract

How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.
Paper Structure (22 sections, 2 theorems, 6 equations, 9 figures, 6 tables, 4 algorithms)

This paper contains 22 sections, 2 theorems, 6 equations, 9 figures, 6 tables, 4 algorithms.

Key Result

Lemma 1

The time complexity of McCatch is $O\left(n~\cdot~n^{1-\frac{1}{u}}\right)$, where $u$ is the intrinsic (correlation fractal) dimensionWe only need distances to compute the fractal dimension $u$, which is how quickly the number of neighbors grows with the distance DBLP:conf/pods/FaloutsosK94. It can

Figures (9)

  • Figure 1: McCatch is unsupervised, and it ALSO works on nondimensional data: (i) on vector, $3$d data from a satellite image of Shanghai -- it spots two $2$-elements microclusters of unusually colored roofs, and a few other outliers; on nondimensional data of last names (ii) and skeletons (iii) -- it gives high anomaly scores to the few nonenglish names and skeletons of wild animals. (best viewed in color)
  • Figure 2: Proposed Axioms: the green microcluster is always more weird, i.e., larger anomaly score. All else being equal, (i) Isolation Axiom -- furthest away microcluster wins; (ii) Cardinality Axiom -- less populous microcluster wins. (best viewed in color)
  • Figure 3: Intuition & the 'Oracle' plot:McCatch spots outliers in a dataset (i) using our 'Oracle' plot (ii). The plot groups inliers like point 'A' (in black) at its bottom-left, and distinguishes outliers by type; see 'B' (orange), 'C' (green), 'D' (violet), and 'E' (red). Outliers 'C' and 'D' from the microcluster in green/violet are isolated at the top. It is made possible by capitalizing on plateaus formed in the count of neighbors of each point as the neighborhood radius varies; see examples in (iii). (best viewed in color)
  • Figure 4: McCatch obtains the Cutoff $\mathbf{d}$ automatically, by partitioning a histogram of $1$NN Distances so to best separate tall and short bins. It is done by minimizing the cost of compressing the partitions. (best viewed in color)
  • Figure 5: McCatch's scores quantify how much each mc $\mathbf{M}_{j}$ is compressed when it is described in terms of the nearest inlier $\mathbf{p}_{i}$. (best viewed in color)
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 1: Plateau
  • Definition 2: First Plateau
  • Definition 3: Middle Plateau
  • Definition 4: Histogram of $\mathbf{1}$NN Distances
  • Definition 5: Cost of Compression
  • Definition 6: Cutoff
  • Definition 7: Score
  • Lemma 1: Time Complexity
  • Proof 1
  • Lemma 2: Space Complexity
  • ...and 1 more