Table of Contents
Fetching ...

Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

Gregor Ulm, Simon Smith, Adrian Nilsson, Emil Gustavsson, Mats Jirstrand

TL;DR

RASTER introduces a grid-based, density-based clustering framework designed for big data, achieving $O(n)$ time and constant memory for a fixed grid by projecting points onto tiles and clustering significant tiles. It addresses hub identification in GPS-like data and provides a parallel variant (P-RASTER) and an input-retaining variant (RASTER$'$). The work offers formal complexity analysis, extensive experiments against standard clustering methods, and demonstrates strong single-threaded and multicore performance, including near-linear scaling on multicore CPUs. Overall, RASTER provides a practical, scalable solution for fast density-based clustering suitable for terabytes-scale data and hub-detection tasks.

Abstract

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementation of RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedup is significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster.

Contraction Clustering (RASTER): A Very Fast Big Data Algorithm for Sequential and Parallel Density-Based Clustering in Linear Time, Constant Memory, and a Single Pass

TL;DR

RASTER introduces a grid-based, density-based clustering framework designed for big data, achieving time and constant memory for a fixed grid by projecting points onto tiles and clustering significant tiles. It addresses hub identification in GPS-like data and provides a parallel variant (P-RASTER) and an input-retaining variant (RASTER). The work offers formal complexity analysis, extensive experiments against standard clustering methods, and demonstrates strong single-threaded and multicore performance, including near-linear scaling on multicore CPUs. Overall, RASTER provides a practical, scalable solution for fast density-based clustering suitable for terabytes-scale data and hub-detection tasks.

Abstract

Clustering is an essential data mining tool for analyzing and grouping similar objects. In big data applications, however, many clustering algorithms are infeasible due to their high memory requirements and/or unfavorable runtime complexity. In contrast, Contraction Clustering (RASTER) is a single-pass algorithm for identifying density-based clusters with linear time complexity. Due to its favorable runtime and the fact that its memory requirements are constant, this algorithm is highly suitable for big data applications where the amount of data to be processed is huge. It consists of two steps: (1) a contraction step which projects objects onto tiles and (2) an agglomeration step which groups tiles into clusters. This algorithm is extremely fast in both sequential and parallel execution. Our quantitative evaluation shows that a sequential implementation of RASTER performs significantly better than various standard clustering algorithms. Furthermore, the parallel speedup is significant: on a contemporary workstation, an implementation in Rust processes a batch of 500 million points with 1 million clusters in less than 50 seconds on one core. With 8 cores, the algorithm is about four times faster.

Paper Structure

This paper contains 31 sections, 3 theorems, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Multiple single-border crossings invariant: Clusters that have neighboring significant tiles in an adjacent slice will be joined, regardless of how many neighboring significant tiles there are across that border and on which side of it they are located.

Figures (8)

  • Figure 1: This sample illustrates the hub identification problem, where the goal is to find dense clusters in a noisy data set. RASTER identifies two dense clusters and ignores the less dense points in the center.
  • Figure 2: High-level visualization of RASTER (best viewed in color). The original input is shown in a), followed by projection to tiles in b) where only significant tiles are retained. Tile-based clusters are visualized in c), which corresponds to RASTER. Clusters that are returned as collections of points are shown in d), which corresponds to the variant RASTER$'$.
  • Figure 3: RASTER can be effectively parallelized (figure best viewed in color). This figure shows how clusters that are separated by borders can be joined in parallel in $\log _{}n$ steps, where $n$ is the number of initial slices. In Fig. \ref{['fig:dc1']} there are multiple clusters that are separated by borders, which are successively joined as borders between slices are removed. The corner cases of clusters repeatedly crossing a border (e.g. $c_4, c_5, c_8$ in Fig. \ref{['fig:dc1']}) and clusters crossing multiple borders are shown (cf. clusters $c_5, c_7, c_9$ in Fig. \ref{['fig:dc1']}). As joining clusters is not a bottleneck in our use case, the idea presented here remains unimplemented. However, it may be instrumental when implementing RASTER for large-scale data processing (cf. Sect. \ref{['sec:future']}). Our code iterates through candidate clusters in slices in a sequential manner from right to left; clusters that do not touch the border of a slice are excluded from this step.
  • Figure 4: RASTER is based on the idea of significant tiles, i.e. tiles that contain more than a threshold $\tau$ number of points. However, as this figure illustrates, the location of the grid can interfere and separate points in close vicinity. In the given example, when using a threshold of $\tau = 4$, no significant tile would be detected. This is only a theoretical problem for our algorithm and its primary use case as hubs contain large numbers of points. However, a post-processing step can solve this minor issue at a very modest cost.
  • Figure 5: Scalability of P-RASTER and P-RASTER$'$ when clustering 50M and 500M points (best viewed in color). With $c$ cores, P-RASTER achieves a performance improvement of around $\frac{c}{2}$. The scalability of P-RASTER$'$, on the other hand, is limited, due to cache synchronization and parallel slowdown issues. The latter is, in particular, an issue with very large data sets.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof