McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

Braulio V. Sánchez Vinces; Robson L. F. Cordeiro; Christos Faloutsos

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

Braulio V. Sánchez Vinces, Robson L. F. Cordeiro, Christos Faloutsos

TL;DR

Mccatch is presented: a new algorithm that detects microclusters by leveraging the proposed ‘Oracle’ plot (1NN Distance versus Group 1NN Distance) and it outperforms 11 other methods, especially when the data has non-singleton microclusters or is nondimensional.

Abstract

How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

TL;DR

Abstract

Paper Structure (22 sections, 2 theorems, 6 equations, 9 figures, 6 tables, 4 algorithms)

This paper contains 22 sections, 2 theorems, 6 equations, 9 figures, 6 tables, 4 algorithms.

Introduction
Problem & Related Work
Problem Statement
Related Work
Proposed Axioms
Proposed Method
Intuition & the 'Oracle' Plot
McCatch in a Nutshell
Build the 'Oracle' Plot
Spot the Microclusters
Compute the Cutoff
Gel the outliers into microclusters
Compute the Anomaly Scores
Time and Space Complexity
Implementation
...and 7 more sections

Key Result

Lemma 1

The time complexity of McCatch is $O\left(n~\cdot~n^{1-\frac{1}{u}}\right)$, where $u$ is the intrinsic (correlation fractal) dimensionWe only need distances to compute the fractal dimension $u$, which is how quickly the number of neighbors grows with the distance DBLP:conf/pods/FaloutsosK94. It can

Figures (9)

Figure 1: McCatch is unsupervised, and it ALSO works on nondimensional data: (i) on vector, $3$d data from a satellite image of Shanghai -- it spots two $2$-elements microclusters of unusually colored roofs, and a few other outliers; on nondimensional data of last names (ii) and skeletons (iii) -- it gives high anomaly scores to the few nonenglish names and skeletons of wild animals. (best viewed in color)
Figure 2: Proposed Axioms: the green microcluster is always more weird, i.e., larger anomaly score. All else being equal, (i) Isolation Axiom -- furthest away microcluster wins; (ii) Cardinality Axiom -- less populous microcluster wins. (best viewed in color)
Figure 3: Intuition & the 'Oracle' plot:McCatch spots outliers in a dataset (i) using our 'Oracle' plot (ii). The plot groups inliers like point 'A' (in black) at its bottom-left, and distinguishes outliers by type; see 'B' (orange), 'C' (green), 'D' (violet), and 'E' (red). Outliers 'C' and 'D' from the microcluster in green/violet are isolated at the top. It is made possible by capitalizing on plateaus formed in the count of neighbors of each point as the neighborhood radius varies; see examples in (iii). (best viewed in color)
Figure 4: McCatch obtains the Cutoff $\mathbf{d}$ automatically, by partitioning a histogram of $1$NN Distances so to best separate tall and short bins. It is done by minimizing the cost of compressing the partitions. (best viewed in color)
Figure 5: McCatch's scores quantify how much each mc $\mathbf{M}_{j}$ is compressed when it is described in terms of the nearest inlier $\mathbf{p}_{i}$. (best viewed in color)
...and 4 more figures

Theorems & Definitions (11)

Definition 1: Plateau
Definition 2: First Plateau
Definition 3: Middle Plateau
Definition 4: Histogram of $\mathbf{1}$NN Distances
Definition 5: Cost of Compression
Definition 6: Cutoff
Definition 7: Score
Lemma 1: Time Complexity
Proof 1
Lemma 2: Space Complexity
...and 1 more

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

TL;DR

Abstract

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (11)