Table of Contents
Fetching ...

Tree-Guided $L_1$-Convex Clustering

Bingyuan Zhang, Yoshikazu Terada

TL;DR

This work addresses the computational bottleneck of obtaining a complete clusterpath in convex clustering for large-scale data by introducing Tree-Guided L1-Convex Clustering (TGCC). TGCC leverages tree-structured weights, notably MSTs, and a dynamic-programming solution to the L1 fused lasso, coupled with a tree-based cluster fusion step that prevents cluster splitting and accelerates path construction. The authors extend TGCC to biclustering (bi-TGCC) and sparse clustering (sp-TGCC) using proximal methods and MST-based structures, achieving strong performance with substantial speedups over existing methods. Across synthetic and real datasets, TGCC delivers competitive clustering accuracy while enabling complete dendrograms for datasets with up to $10^6$ points on a standard laptop, highlighting its practical impact for scalable hierarchical clustering and exploratory data analysis.

Abstract

Convex clustering is a modern clustering framework that guarantees globally optimal solutions and performs comparably to other advanced clustering methods. However, obtaining a complete dendrogram (clusterpath) for large-scale datasets remains computationally challenging due to the extensive costs associated with iterative optimization approaches. To address this limitation, we develop a novel convex clustering algorithm called Tree-Guided $L_1$-Convex Clustering (TGCC). We first focus on the fact that the loss function of $L_1$-convex clustering with tree-structured weights can be efficiently optimized using a dynamic programming approach. We then develop an efficient cluster fusion algorithm that utilizes the tree structure of the weights to accelerate the optimization process and eliminate the issue of cluster splits commonly observed in convex clustering. By combining the dynamic programming approach with the cluster fusion algorithm, the TGCC algorithm achieves superior computational efficiency without sacrificing clustering performance. Remarkably, our TGCC algorithm can construct a complete clusterpath for $10^6$ points in $\mathbb{R}^2$ within 15 seconds on a standard laptop without the need for parallel or distributed computing frameworks. Moreover, we extend the TGCC algorithm to develop biclustering and sparse convex clustering algorithms.

Tree-Guided $L_1$-Convex Clustering

TL;DR

This work addresses the computational bottleneck of obtaining a complete clusterpath in convex clustering for large-scale data by introducing Tree-Guided L1-Convex Clustering (TGCC). TGCC leverages tree-structured weights, notably MSTs, and a dynamic-programming solution to the L1 fused lasso, coupled with a tree-based cluster fusion step that prevents cluster splitting and accelerates path construction. The authors extend TGCC to biclustering (bi-TGCC) and sparse clustering (sp-TGCC) using proximal methods and MST-based structures, achieving strong performance with substantial speedups over existing methods. Across synthetic and real datasets, TGCC delivers competitive clustering accuracy while enabling complete dendrograms for datasets with up to points on a standard laptop, highlighting its practical impact for scalable hierarchical clustering and exploratory data analysis.

Abstract

Convex clustering is a modern clustering framework that guarantees globally optimal solutions and performs comparably to other advanced clustering methods. However, obtaining a complete dendrogram (clusterpath) for large-scale datasets remains computationally challenging due to the extensive costs associated with iterative optimization approaches. To address this limitation, we develop a novel convex clustering algorithm called Tree-Guided -Convex Clustering (TGCC). We first focus on the fact that the loss function of -convex clustering with tree-structured weights can be efficiently optimized using a dynamic programming approach. We then develop an efficient cluster fusion algorithm that utilizes the tree structure of the weights to accelerate the optimization process and eliminate the issue of cluster splits commonly observed in convex clustering. By combining the dynamic programming approach with the cluster fusion algorithm, the TGCC algorithm achieves superior computational efficiency without sacrificing clustering performance. Remarkably, our TGCC algorithm can construct a complete clusterpath for points in within 15 seconds on a standard laptop without the need for parallel or distributed computing frameworks. Moreover, we extend the TGCC algorithm to develop biclustering and sparse convex clustering algorithms.

Paper Structure

This paper contains 26 sections, 14 equations, 14 figures, 7 tables, 2 algorithms.

Figures (14)

  • Figure 1: A Chaining phenomenon of Single Linkage Clustering (SLC). Left: Two Moons data (TM). Right top: the result of SLC for the data in the left panel. Right bottom: the result of the proposed algorithm for the same data.
  • Figure 2: An illustration of the tree-based cluster fusion algorithm. Left: the original tree structure of nodes. Nodes sharing a red edge have the same $\hat{\boldsymbol{\theta}}$ values. Arrows: the tree traversal order via Breadth-First Search (BFS). Middle: the clusters found by the cluster fusion algorithm. Right: the new tree structure.
  • Figure 3: Synthetic data (Left: Gaussian Mixture 1 (GM1), Middle: Gaussian Mixture 2 (GM2), Right: Two Circles (TC))
  • Figure 4: Runtime comparison of TGCC and other clustering algorithms. Average runtimes over five repetitions
  • Figure 5: Runtime comparison of TGCC and the naive DP algorithm (Blue: the naive DP algorithm, Red: TGCC; Solid: with 50 values of $\lambda$, Dashed: with 100 values of $\lambda$, Dotted: with 200 values of $\lambda$ ).
  • ...and 9 more figures