Table of Contents
Fetching ...

Faster Parallel Triangular Maximally Filtered Graphs and Hierarchical Clustering

Steven Raphael, Julian Shun

TL;DR

The paper tackles the scalability of hierarchical clustering via TMFG-DBHT by introducing more parallel TMFG construction methods. It develops correlation-based TMFG (corr-tmfg) and heap-based TMFG (heap-tmfg), plus ancillary optimizations like vectorization and approximate APSP to drastically reduce runtime while preserving clustering accuracy. Empirical results on 18 UCR datasets show 3.7–10.7x speedups over the prior state-of-the-art, with up to 34x self-relative speedup on 48 cores and only minor or no degradation in clustering quality. The proposed methods enable efficient, accurate TMFG-DBHT clustering on large-scale time-series data, yielding practical benefits for real-time or large-domain clustering tasks.

Abstract

Filtered graphs provide a powerful tool for data clustering. The triangular maximally filtered graph (TMFG) method, when combined with the directed bubble hierarchy tree (DBHT) method, defines a useful algorithm for hierarchical data clustering. This combined TMFG-DBHT algorithm has been shown to produce clusters with good accuracy for time series data, but the previous state-of-the-art parallel algorithm has limited parallelism. This paper presents an improved parallel algorithm for TMFG-DBHT. Our algorithm increases the amount of parallelism by aggregating the bulk of the work of TMFG construction together to reduce the overheads of parallelism. Furthermore, our TMFG algorithm updates information lazily, which reduces the overall work. We find further speedups by computing all-pairs shortest paths approximately instead of exactly in DBHT. We show experimentally that our algorithm gives a 3.7--10.7x speedup over the previous state-of-the-art TMFG-DBHT implementation, while preserving clustering accuracy.

Faster Parallel Triangular Maximally Filtered Graphs and Hierarchical Clustering

TL;DR

The paper tackles the scalability of hierarchical clustering via TMFG-DBHT by introducing more parallel TMFG construction methods. It develops correlation-based TMFG (corr-tmfg) and heap-based TMFG (heap-tmfg), plus ancillary optimizations like vectorization and approximate APSP to drastically reduce runtime while preserving clustering accuracy. Empirical results on 18 UCR datasets show 3.7–10.7x speedups over the prior state-of-the-art, with up to 34x self-relative speedup on 48 cores and only minor or no degradation in clustering quality. The proposed methods enable efficient, accurate TMFG-DBHT clustering on large-scale time-series data, yielding practical benefits for real-time or large-domain clustering tasks.

Abstract

Filtered graphs provide a powerful tool for data clustering. The triangular maximally filtered graph (TMFG) method, when combined with the directed bubble hierarchy tree (DBHT) method, defines a useful algorithm for hierarchical data clustering. This combined TMFG-DBHT algorithm has been shown to produce clusters with good accuracy for time series data, but the previous state-of-the-art parallel algorithm has limited parallelism. This paper presents an improved parallel algorithm for TMFG-DBHT. Our algorithm increases the amount of parallelism by aggregating the bulk of the work of TMFG construction together to reduce the overheads of parallelism. Furthermore, our TMFG algorithm updates information lazily, which reduces the overall work. We find further speedups by computing all-pairs shortest paths approximately instead of exactly in DBHT. We show experimentally that our algorithm gives a 3.7--10.7x speedup over the previous state-of-the-art TMFG-DBHT implementation, while preserving clustering accuracy.
Paper Structure (12 sections, 7 figures, 1 table, 2 algorithms)

This paper contains 12 sections, 7 figures, 1 table, 2 algorithms.

Figures (7)

  • Figure 1: This figure shows an example of the first iteration of the corr-tmfg algorithm. (a) The initial 4-clique in the TMFG. (b) For each face, the closest vertex to each of the face's three vertices is found (the subfigure shows this process for the face $\{1,2,3\}$). (c) Among the three closest vertices, the vertex with maximum gain is selected for that face to form a face-vertex pair (for face $\{1,2,3\}$, this is vertex $6$). (d) Among all face-vertex pairs, the one with maximum gain is added to the TMFG (this is vertex $6$).
  • Figure 2: Parallel runtime of TMFG-DBHT methods on different datasets.
  • Figure 3: Self-relative parallel speedup of opt-tdbht on the three largest datasets for different numbers of cores. "48h" means 48 cores with 2-way hyper-threading.
  • Figure 4: Self-relative parallel speedup of par-tdbht-10 on the three largest datasets for different numbers of cores. "48h" means 48 cores with 2-way hyper-threading.
  • Figure 5: Time breakdown of different algorithms on the Crop dataset, running on 48 cores with hyper-threading (left) and one core (right).
  • ...and 2 more figures