Table of Contents
Fetching ...

Human-interpretable clustering of short-text using large language models

Justin K. Miller, Tristram J. Alexander

TL;DR

This work tackles short-text clustering by addressing the interpretability gap that plagues automated metrics. It proposes a pipeline where short texts are embedded with a large language model (MiniLM) and clustered using Gaussian Mixture Models, with interpretability validated via human reviewers and an automated LLM (ChatGPT). Results show MiniLM-based clusters are more distinctive and human-interpretable than those from LDA or doc2vec, and ChatGPT can closely mirror human judgments, though biases remain. The study also introduces quantitative interpretability and distinctiveness metrics and argues for including a null model to separate signal from noise. Overall, the approach provides a scalable framework for generating and validating interpretable short-text clusters with potential impact on real-time social media analysis.

Abstract

Clustering short text is a difficult problem, due to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text. In this study clusters are found in the embedding space using Gaussian Mixture Modelling (GMM). The resulting clusters are found to be more distinctive and more human-interpretable than clusters produced using the popular methods of doc2vec and Latent Dirichlet Allocation (LDA). The success of the clustering approach is quantified using human reviewers and through the use of a generative LLM. The generative LLM shows good agreement with the human reviewers, and is suggested as a means to bridge the `validation gap' which often exists between cluster production and cluster interpretation. The comparison between LLM-coding and human-coding reveals intrinsic biases in each, challenging the conventional reliance on human coding as the definitive standard for cluster validation.

Human-interpretable clustering of short-text using large language models

TL;DR

This work tackles short-text clustering by addressing the interpretability gap that plagues automated metrics. It proposes a pipeline where short texts are embedded with a large language model (MiniLM) and clustered using Gaussian Mixture Models, with interpretability validated via human reviewers and an automated LLM (ChatGPT). Results show MiniLM-based clusters are more distinctive and human-interpretable than those from LDA or doc2vec, and ChatGPT can closely mirror human judgments, though biases remain. The study also introduces quantitative interpretability and distinctiveness metrics and argues for including a null model to separate signal from noise. Overall, the approach provides a scalable framework for generating and validating interpretable short-text clusters with potential impact on real-time social media analysis.

Abstract

Clustering short text is a difficult problem, due to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text. In this study clusters are found in the embedding space using Gaussian Mixture Modelling (GMM). The resulting clusters are found to be more distinctive and more human-interpretable than clusters produced using the popular methods of doc2vec and Latent Dirichlet Allocation (LDA). The success of the clustering approach is quantified using human reviewers and through the use of a generative LLM. The generative LLM shows good agreement with the human reviewers, and is suggested as a means to bridge the `validation gap' which often exists between cluster production and cluster interpretation. The comparison between LLM-coding and human-coding reveals intrinsic biases in each, challenging the conventional reliance on human coding as the definitive standard for cluster validation.
Paper Structure (17 sections, 7 equations, 6 figures)

This paper contains 17 sections, 7 equations, 6 figures.

Figures (6)

  • Figure 1: Ordinal Regression Analysis of Clusters created by LDA, Doc2vec, and LLM illustrating the outcomes of an ordinal regression model applied to the ratings of 39 reviewers. The reviewers assessed Twitter bio clusters and were asked to rate the coherence of the clusters across four categories: (a) Confidence in Cluster Name, (b) Coherence of Top Words, (c) Coherence of Sample Bios, and (d) Coherence between the top words and sample bios. Each panel (a-d) represents the probability density function ($\hat{Y}$ ) of ratings in each category, showcasing the statistical modeling of ordered categorical data. The reviewer ratings of the random clusters are set to 0 so any value greater than 0 is performing better than random. The vertical lines identify the transitions between values on the likert scales used by the reviewers. We see that MiniLM has consistently been scored at 4 on all measures.
  • Figure 2: The median reviewer score for each cluster across the four categories: (a) Confidence in Cluster Name, (b) Coherence of Top Words, (c) Coherence of Sample Bios, and (d) Coherence between the top words and sample bios. Error bars represent the first and third quartile scores. All clusters with the U designation were unable to be named by the authors of this paper.
  • Figure 3: The number of keywords in each cluster, where keywords are determined by comparing the expected frequency of words in a cluster against their actual frequency, using a Bayesian factor greater than 10 to assess if the difference is statistically significant. The names of each cluster are the names given by the authors and the prefix U indicates that a cluster was not able to be named. Colors identify the clustering model used and are consistent with the color code used in Fig. \ref{['fig:Median']}. The MiniLM clusters have significantly more keywords than the clusters found using the other methods.
  • Figure 4: Boxplots showing the Spearman Rank Correlation between the reviewer provided ratings and the six automated metrics of mean cluster standard deviation, silhouette score, Euclidean distance, number of keywords, CV coherence and UMass coherence (x-axis labels), with reviewer ratings on (a) Confidence in naming; (b) Coherence of top words; (c) Coherence of sample bios and (d) Coherence between top words and sample bios. Coherence and keywords correlate poorly with reviewer ratings. Mean standard deviation appears to provide the best correlation with the ratings. The large variability in correlation across reviewers is evident, with outliers identified with solid circles, i.e., some reviewers correlated well with some measures, while others showed no correlation or negative correlation for some measures.
  • Figure 5: The five words with the highest Consistency ($S$) used by (a) reviewers and (b) ChatGPT to name the clusters created by MiniLM. Along the x-axis are the names given to each cluster by the authors of this paper. We see that the words used to describe the clusters are largely consistent between ChatGPT and the reviewers, however there are cluster-dependent distinctions revealing human and machine limitations as discussed in the text.
  • ...and 1 more figures