Moving Past Single Metrics: Exploring Short-Text Clustering Across Multiple Resolutions
Justin Miller, Tristram Alexander
TL;DR
This work addresses short-text clustering by moving beyond a single-metric optimum and examining cluster structure across multiple resolutions. It clusters 30,000 Twitter bios using Gaussian Mixture Models on MiniLM embeddings, tracking how bios move between clusters as the number of clusters $K$ increases from 1 to 20, and visualizes transitions with Sankey diagrams. The authors introduce Adjusted Mutual Information (AMI) alongside a novel Proportional Stability metric to quantify cross-resolution robustness, showing that most clusters subdivide rather than reorganize across $K$. The results support a practical, user-centered view of clustering that prioritizes interpretability and stability over a single optimal solution, with broader implications for multi-resolution validation in short-text data analysis.
Abstract
Cluster number is typically a parameter selected at the outset in clustering problems, and while impactful, the choice can often be difficult to justify. Inspired by bioinformatics, this study examines how the nature of clusters varies with cluster number, presenting a method for determining cluster robustness, and providing a systematic method for deciding on the cluster number. The study focuses specifically on short-text clustering, involving 30,000 political Twitter bios, where the sparse co-occurrence of words between texts makes finding meaningful clusters challenging. A metric of proportional stability is introduced to uncover the stability of specific clusters between cluster resolutions, and the results are visualised using Sankey diagrams to provide an interrogative tool for understanding the nature of the dataset. The visualisation provides an intuitive way to track cluster subdivision and reorganisation as cluster number increases, offering insights that static, single-resolution metrics cannot capture. The results show that instead of seeking a single 'optimal' solution, choosing a cluster number involves balancing informativeness and complexity.
