Summaries as Centroids for Interpretable and Scalable Text Clustering
Jairo Diaz-Rodriguez
TL;DR
The paper tackles interpretable text clustering by replacing numeric centroids in k-means with textual summaries that are re-embedded, preserving the standard objective $\min_{C_1,...,C_k} \sum_{j=1}^k \sum_{i\in C_j} \|\boldsymbol{x}_i - \boldsymbol{\mu}_j\|^2$ between summary steps. It introduces two variants, k-NLPmeans (LLM-free, using classical summarizers) and k-LLMmeans (LLM-assisted with a fixed per-iteration budget), and extends the approach to mini-batch streaming clustering. Through extensive experiments on four static datasets with multiple embeddings and summarizers, the method achieves consistent gains over traditional baselines and approaches the accuracy of recent LLM-based clustering while reducing LLM usage. A sequential streaming case study and a StackExchange-based streaming benchmark demonstrate practical interpretability and scalability for real-time text clustering, with modest costs and robust performance across settings.
Abstract
We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.
