Table of Contents
Fetching ...

Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions

Justin Miller, Tristram Alexander

TL;DR

This work addresses short-text clustering by moving beyond a single-metric optimum and examining cluster structure across multiple resolutions. It clusters 30,000 Twitter bios using Gaussian Mixture Models on MiniLM embeddings, tracking how bios move between clusters as the number of clusters $K$ increases from 1 to 20, and visualizes transitions with Sankey diagrams. The authors introduce Adjusted Mutual Information (AMI) alongside a novel Proportional Stability metric to quantify cross-resolution robustness, showing that most clusters subdivide rather than reorganize across $K$. The results support a practical, user-centered view of clustering that prioritizes interpretability and stability over a single optimal solution, with broader implications for multi-resolution validation in short-text data analysis.

Abstract

Cluster number is typically a parameter selected at the outset in clustering problems, and while impactful, the choice can often be difficult to justify. Inspired by bioinformatics, this study examines how the nature of clusters varies with cluster number, presenting a method for determining cluster robustness, and providing a systematic method for deciding on the cluster number. The study focuses specifically on short-text clustering, involving 30,000 political Twitter bios, where the sparse co-occurrence of words between texts makes finding meaningful clusters challenging. A metric of proportional stability is introduced to uncover the stability of specific clusters between cluster resolutions, and the results are visualised using Sankey diagrams to provide an interrogative tool for understanding the nature of the dataset. The visualisation provides an intuitive way to track cluster subdivision and reorganisation as cluster number increases, offering insights that static, single-resolution metrics cannot capture. The results show that instead of seeking a single 'optimal' solution, choosing a cluster number involves balancing informativeness and complexity.

Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions

TL;DR

This work addresses short-text clustering by moving beyond a single-metric optimum and examining cluster structure across multiple resolutions. It clusters 30,000 Twitter bios using Gaussian Mixture Models on MiniLM embeddings, tracking how bios move between clusters as the number of clusters increases from 1 to 20, and visualizes transitions with Sankey diagrams. The authors introduce Adjusted Mutual Information (AMI) alongside a novel Proportional Stability metric to quantify cross-resolution robustness, showing that most clusters subdivide rather than reorganize across . The results support a practical, user-centered view of clustering that prioritizes interpretability and stability over a single optimal solution, with broader implications for multi-resolution validation in short-text data analysis.

Abstract

Cluster number is typically a parameter selected at the outset in clustering problems, and while impactful, the choice can often be difficult to justify. Inspired by bioinformatics, this study examines how the nature of clusters varies with cluster number, presenting a method for determining cluster robustness, and providing a systematic method for deciding on the cluster number. The study focuses specifically on short-text clustering, involving 30,000 political Twitter bios, where the sparse co-occurrence of words between texts makes finding meaningful clusters challenging. A metric of proportional stability is introduced to uncover the stability of specific clusters between cluster resolutions, and the results are visualised using Sankey diagrams to provide an interrogative tool for understanding the nature of the dataset. The visualisation provides an intuitive way to track cluster subdivision and reorganisation as cluster number increases, offering insights that static, single-resolution metrics cannot capture. The results show that instead of seeking a single 'optimal' solution, choosing a cluster number involves balancing informativeness and complexity.

Paper Structure

This paper contains 6 sections, 6 equations, 3 figures.

Figures (3)

  • Figure 1: The average AMI (\ref{['eq:AMI']}) between the original clustering (seed = 0), and 100 iterations of clusters created using only 80% of dimensions of the embedding space created by the MiniLM Language Model, 80% of the Data, and different seeds. The error bars represent the standard deviation of the AMI.
  • Figure 2: The AMI (\ref{['eq:AMI']}) (blue line, circles) and proportional stability (\ref{['eq:Stability']}) (green line, squares) are shown between successive clustering levels, as $K$ increases. The scattered green circles show the individual proportional stability results for each cluster. A proportional stability close to 1 indicates that a cluster has largely a single parent. The red dashed line corresponds to a proportional stability of 0.5. Clusters sitting below this line are 'new' in that they are combinations of clusters at the lower resolution.
  • Figure 3: Shows the proportion of bios that can be found in a single cluster at the previous hierarchical clustering level as the number of clusters increases from 1 - 11. Names are created using Google Gemini and a sample of bios from the cluster and the top words. The colour of each cluster is given by the Proportional Stability of the cluster. Where the more yellow (lighter) a cluster is the greater proportion of it came from a cluster at a previous resolution. Whereas the more blue (darker) a cluster is, the more the cluster is made up of a mix of different clusters at a lower resolution.