Table of Contents
Fetching ...

Summaries as Centroids for Interpretable and Scalable Text Clustering

Jairo Diaz-Rodriguez

TL;DR

The paper tackles interpretable text clustering by replacing numeric centroids in k-means with textual summaries that are re-embedded, preserving the standard objective $\min_{C_1,...,C_k} \sum_{j=1}^k \sum_{i\in C_j} \|\boldsymbol{x}_i - \boldsymbol{\mu}_j\|^2$ between summary steps. It introduces two variants, k-NLPmeans (LLM-free, using classical summarizers) and k-LLMmeans (LLM-assisted with a fixed per-iteration budget), and extends the approach to mini-batch streaming clustering. Through extensive experiments on four static datasets with multiple embeddings and summarizers, the method achieves consistent gains over traditional baselines and approaches the accuracy of recent LLM-based clustering while reducing LLM usage. A sequential streaming case study and a StackExchange-based streaming benchmark demonstrate practical interpretability and scalability for real-time text clustering, with modest costs and robust performance across settings.

Abstract

We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

Summaries as Centroids for Interpretable and Scalable Text Clustering

TL;DR

The paper tackles interpretable text clustering by replacing numeric centroids in k-means with textual summaries that are re-embedded, preserving the standard objective between summary steps. It introduces two variants, k-NLPmeans (LLM-free, using classical summarizers) and k-LLMmeans (LLM-assisted with a fixed per-iteration budget), and extends the approach to mini-batch streaming clustering. Through extensive experiments on four static datasets with multiple embeddings and summarizers, the method achieves consistent gains over traditional baselines and approaches the accuracy of recent LLM-based clustering while reducing LLM usage. A sequential streaming case study and a StackExchange-based streaming benchmark demonstrate practical interpretability and scalability for real-time text clustering, with modest costs and robust performance across settings.

Abstract

We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

Paper Structure

This paper contains 26 sections, 5 equations, 2 figures, 9 tables, 2 algorithms.

Figures (2)

  • Figure 1: Illustration of k-NLPmeans/k-LLMmeans with a single summarization step. First panel shows the text embeddings with stars marking the initial centroids; second shows the partition reached after k-means iterations (a local minimum); third performs the summarization step, each previous cluster is summarized into a textual prototype and re-embedded; final panel runs one more k-means iteration using these summaries as centroids, yielding a qualitatively improved partition.
  • Figure 2: Sequential evolution of the LLM-generated centroids for three primary clusters during the three batches of the sequential mini-batch k-LLMmeans process applied to 2021 posts from the AI Stack Exchange site StackExchangeData. Main aspects are manually highlighted.