Faster Parallel Triangular Maximally Filtered Graphs and Hierarchical Clustering
Steven Raphael, Julian Shun
TL;DR
The paper tackles the scalability of hierarchical clustering via TMFG-DBHT by introducing more parallel TMFG construction methods. It develops correlation-based TMFG (corr-tmfg) and heap-based TMFG (heap-tmfg), plus ancillary optimizations like vectorization and approximate APSP to drastically reduce runtime while preserving clustering accuracy. Empirical results on 18 UCR datasets show 3.7–10.7x speedups over the prior state-of-the-art, with up to 34x self-relative speedup on 48 cores and only minor or no degradation in clustering quality. The proposed methods enable efficient, accurate TMFG-DBHT clustering on large-scale time-series data, yielding practical benefits for real-time or large-domain clustering tasks.
Abstract
Filtered graphs provide a powerful tool for data clustering. The triangular maximally filtered graph (TMFG) method, when combined with the directed bubble hierarchy tree (DBHT) method, defines a useful algorithm for hierarchical data clustering. This combined TMFG-DBHT algorithm has been shown to produce clusters with good accuracy for time series data, but the previous state-of-the-art parallel algorithm has limited parallelism. This paper presents an improved parallel algorithm for TMFG-DBHT. Our algorithm increases the amount of parallelism by aggregating the bulk of the work of TMFG construction together to reduce the overheads of parallelism. Furthermore, our TMFG algorithm updates information lazily, which reduces the overall work. We find further speedups by computing all-pairs shortest paths approximately instead of exactly in DBHT. We show experimentally that our algorithm gives a 3.7--10.7x speedup over the previous state-of-the-art TMFG-DBHT implementation, while preserving clustering accuracy.
