Table of Contents
Fetching ...

Hierarchical clustering with maximum density paths and mixture models

Martin Ritzert, Polina Turishcheva, Laura Hansel, Paul Wollenhaupt, Marissa A. Weis, Alexander S. Ecker

TL;DR

t-NEB introduces a probabilistically grounded hierarchical clustering framework for high-dimensional data by overclustering with a Student's $t$ mixture model to produce a density landscape, then using nudged elastic band (NEB) paths to define maximum-density connections between clusters. A bottom-up merging procedure constructs a hierarchy from the initial overclustered partition without re-estimating centers, yielding a dedrogram that reflects multi-scale structure. The approach achieves state-of-the-art or competitive performance on synthetic and real datasets (including MNIST-Nd embeddings and transcriptomic cell types) and provides interpretable hierarchies that reveal fine-grained patterns. By unifying density estimation and merging under a single probabilistic model and avoiding dimensionality reduction, t-NEB offers a robust, scalable tool for exploratory analysis of complex data with ambiguous cluster boundaries.

Abstract

Hierarchical clustering is an effective, interpretable method for analyzing structure in data. It reveals insights at multiple scales without requiring a predefined number of clusters and captures nested patterns and subtle relationships, which are often missed by flat clustering approaches. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. In this work, we introduce t-NEB, a probabilistically grounded hierarchical clustering method, which yields state-of-the-art clustering performance on naturalistic high-dimensional data. t-NEB consists of three steps: (1) density estimation via overclustering; (2) finding maximum density paths between clusters; (3) creating a hierarchical structure via bottom-up cluster merging. t-NEB uses a probabilistic parametric density model for both overclustering and cluster merging, which yields both high clustering performance and a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at https://github.com/ecker-lab/tneb clustering.

Hierarchical clustering with maximum density paths and mixture models

TL;DR

t-NEB introduces a probabilistically grounded hierarchical clustering framework for high-dimensional data by overclustering with a Student's mixture model to produce a density landscape, then using nudged elastic band (NEB) paths to define maximum-density connections between clusters. A bottom-up merging procedure constructs a hierarchy from the initial overclustered partition without re-estimating centers, yielding a dedrogram that reflects multi-scale structure. The approach achieves state-of-the-art or competitive performance on synthetic and real datasets (including MNIST-Nd embeddings and transcriptomic cell types) and provides interpretable hierarchies that reveal fine-grained patterns. By unifying density estimation and merging under a single probabilistic model and avoiding dimensionality reduction, t-NEB offers a robust, scalable tool for exploratory analysis of complex data with ambiguous cluster boundaries.

Abstract

Hierarchical clustering is an effective, interpretable method for analyzing structure in data. It reveals insights at multiple scales without requiring a predefined number of clusters and captures nested patterns and subtle relationships, which are often missed by flat clustering approaches. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. In this work, we introduce t-NEB, a probabilistically grounded hierarchical clustering method, which yields state-of-the-art clustering performance on naturalistic high-dimensional data. t-NEB consists of three steps: (1) density estimation via overclustering; (2) finding maximum density paths between clusters; (3) creating a hierarchical structure via bottom-up cluster merging. t-NEB uses a probabilistic parametric density model for both overclustering and cluster merging, which yields both high clustering performance and a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at https://github.com/ecker-lab/tneb clustering.

Paper Structure

This paper contains 46 sections, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Overview of t-NEB clustering procedure. A: Illustrative toy dataset with hierarchical density model consisting of six Gaussians. Shading: probability density. Points: samples. B: Overclustering using a Student's $t$ mixture model with 15 components. Centers are marked by 'x' and colors indicate overclustered assignments. Lines are maximum density paths, width indicating the minimum density on the path which we use as similarity for cluster merging. C: We iteratively merge clusters starting with the minimum density as threshold. Merges are solely based on the initial overclustered partition, resuling in consistent clusterings at any level of granularity. At both three and six the threshold jumps, indicating meaningful clusterings. D: Dendrogram of the hierarchical merging procedure. The thresholds from C leading to three or six clusters are clearly visible, showing that the algorithm has taken up on the hierarchical nature of the dataset.
  • Figure 2: Density landscape thresholded at different "water levels" leading to 6, 3, and 2 clusters.
  • Figure 3: t-NEB Hierarchical Clustering
  • Figure 4: Optimization of a maximum-density path (yellow line) using the NEB algorithm. The minimum density along this path is our measure of distance between two mixture components. Bottom: Probability density along the NEB path, estimated from the mixture model.
  • Figure 5: Nudged Elastic Band (NEB) Distance
  • ...and 18 more figures