Table of Contents
Fetching ...

Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone

Justin Miller, Tristram Alexander

TL;DR

This study addresses how to balance cluster granularity and interpretability in short-text clustering using LLM-generated cluster names. It applies Gaussian Mixture Models on embeddings produced by an LLM and evaluates cluster quality with semantic density, Adjusted Mutual Information, and accuracy to locate a Goldilocks zone. The results identify an optimal range of $K$ around $16$–$22$ where clusters remain distinct yet interpretable, driven by the semantic alignment between bios and cluster names. The findings offer practical guidance for selecting cluster counts and naming strategies, and highlight the importance of reliability considerations such as locally hosted LLMs for reproducible, interpretable clustering in real-world deployments.

Abstract

The challenge of clustering short text data lies in balancing informativeness with interpretability. Traditional evaluation metrics often overlook this trade-off. Inspired by linguistic principles of communicative efficiency, this paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity. We use large language models (LLMs) to generate cluster names and evaluate their effectiveness through semantic density, information theory, and clustering accuracy. Our results show that Gaussian Mixture Model (GMM) clustering on embeddings generated by a LLM, increases semantic density compared to random assignment, effectively grouping similar bios. However, as clusters increase, interpretability declines, as measured by a generative LLM's ability to correctly assign bios based on cluster names. A logistic regression analysis confirms that classification accuracy depends on the semantic similarity between bios and their assigned cluster names, as well as their distinction from alternatives. These findings reveal a "Goldilocks zone" where clusters remain distinct yet interpretable. We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization. These insights inform both theoretical models and practical applications, guiding future research toward optimising cluster interpretability and usefulness.

Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone

TL;DR

This study addresses how to balance cluster granularity and interpretability in short-text clustering using LLM-generated cluster names. It applies Gaussian Mixture Models on embeddings produced by an LLM and evaluates cluster quality with semantic density, Adjusted Mutual Information, and accuracy to locate a Goldilocks zone. The results identify an optimal range of around where clusters remain distinct yet interpretable, driven by the semantic alignment between bios and cluster names. The findings offer practical guidance for selecting cluster counts and naming strategies, and highlight the importance of reliability considerations such as locally hosted LLMs for reproducible, interpretable clustering in real-world deployments.

Abstract

The challenge of clustering short text data lies in balancing informativeness with interpretability. Traditional evaluation metrics often overlook this trade-off. Inspired by linguistic principles of communicative efficiency, this paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity. We use large language models (LLMs) to generate cluster names and evaluate their effectiveness through semantic density, information theory, and clustering accuracy. Our results show that Gaussian Mixture Model (GMM) clustering on embeddings generated by a LLM, increases semantic density compared to random assignment, effectively grouping similar bios. However, as clusters increase, interpretability declines, as measured by a generative LLM's ability to correctly assign bios based on cluster names. A logistic regression analysis confirms that classification accuracy depends on the semantic similarity between bios and their assigned cluster names, as well as their distinction from alternatives. These findings reveal a "Goldilocks zone" where clusters remain distinct yet interpretable. We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization. These insights inform both theoretical models and practical applications, guiding future research toward optimising cluster interpretability and usefulness.

Paper Structure

This paper contains 14 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Mean semantic density (mean cosine similarity between embeddings of bios within each cluster) as a function of the number of clusters (2–50) for Gaussian Mixture Model (GMM) clustering and randomly assigned clusters. Data points represent averages calculated from up to 10,000 random pairs per cluster, capped at the maximum number of available pairs. Error bars indicate the standard error of the mean semantic density across clusters. GMM clustering results are shown with circles and solid lines, while randomly assigned clusters are shown with squares and dashed lines. The y-axis is capped at 0.4 to focus on the observed range of semantic density.
  • Figure 2: Comparison of Adjusted Mutual Information (AMI) and Accuracy between Gaussian Mixture Model (GMM) clustering and a random baseline. A LLM was given 1000 random bios at each clustering level and asked to identify which cluster it belonged to having been given the cluster names. (a) AMI quantifies the agreement between true cluster assignments and predicted cluster assignments, adjusted for chance. Higher values indicate better clustering performance. (b) Accuracy measures the proportion of bios that were correctly assigned to a cluster when given only the cluster names, representing the interpretability of the clustering process. Each data point represents the mean value across experiments for a given number of clusters, with error bars indicating the standard deviation. GMM consistently outperforms the random baseline in both AMI and Simplicity at lower levels but at high number of clusters, the AMI between GMM and Random is almost the same.
  • Figure 3: Comparison of GMM-based clustering versus randomly assigned clusters, ranked by how many standard deviations each method’s performance (AMI and Simplicity) deviates from Random. For each number of clusters, we compute the difference between GMM and Random in standard deviation units, then assign a rank (1 = best). Higher rankings indicate that GMM’s performance is consistently farther above Random, whereas lower suggest Random rivaling or exceeding GMM. By visually identifying where rankings are highest for both AMI (blue) and Simplicity (red), one can approximate the optimal number of clusters that best balances complexity and interpretability.
  • Figure 4: This figure illustrates the kernel density estimate (KDE) distributions of cosine similarity differences for bios. The cosine similarity difference is calculated as the difference between the similarity of a bio with its assigned cluster name and the similarity of the same bio with the most similar incorrect cluster name. The blue curve represents the density of bios correctly identified by the LLM, while the orange curve corresponds to bios incorrectly identified. The red line, plotted on a secondary y-axis, shows the proportion of correct cluster name assignments across binned cosine similarity differences (0.01 increments) within the range -0.4 to 0.4, as data outside this range was sparse.