Text Clustering as Classification with LLMs

Chen Huang; Guoxiu He

Text Clustering as Classification with LLMs

Chen Huang, Guoxiu He

TL;DR

This work tackles unsupervised text clustering by removing the need for fine-tuned embeddings and traditional clustering algorithms. It reframes clustering as a two-stage classification problem using in-context learning in LLMs: first generating semantically meaningful candidate labels from data, then assigning samples to these labels to form clusters. Across five diverse datasets, the approach achieves comparable or superior clustering performance to embedding-based baselines while reducing computational overhead and enabling human-readable cluster explanations. The method demonstrates the potential of LLMs to simplify and enhance clustering with interpretable labels and provides public code for replication and extension.

Abstract

Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models (LLMs) and their demonstrated effectiveness across a broad spectrum of NLP tasks, an emerging body of research has begun to explore their potential in the domain of text clustering. However, existing LLM-based approaches still rely on fine-tuned embedding models and sophisticated similarity metrics, rendering them computationally intensive and necessitating domain-specific adaptation. To address these limitations, we propose a novel framework that reframes text clustering as a classification task by harnessing the in-context learning capabilities of LLMs. Our framework eliminates the need for fine-tuning embedding models or intricate clustering algorithms. It comprises two key steps: first, the LLM is prompted to generate a set of candidate labels based on the dataset and then merges semantically similar labels; second, it assigns the most appropriate label to each text sample. By leveraging the advanced natural language understanding and generalization capabilities of LLMs, the proposed approach enables effective clustering with minimal human intervention. Experimental results on diverse datasets demonstrate that our framework achieves comparable or superior performance to state-of-the-art embedding-based clustering techniques, while significantly reducing computational complexity and resource requirements. These findings underscore the transformative potential of LLMs in simplifying and enhancing text clustering tasks. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM. We also provide the supplementary Appendix within the repository.

Text Clustering as Classification with LLMs

TL;DR

Abstract

Paper Structure (31 sections, 9 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 9 equations, 3 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Clustering
Adding Explanations to Text Clusters
Text Clustering using LLMs
Methodology
Task Definition
Label Generation Using LLMs
Potential Label Generation
Potential Labels Aggregation and Mergence
Given Label Classification
Experiment
Dataset Description
Implementation Details
Evaluation Metrics
...and 16 more sections

Figures (3)

Figure 1: A comparison between other methods using LLMs (left) and our framework (right) for text clustering. Our framework transforms the clustering task into a text classification task by generating potential labels (Stage 1) and classifying input sentences according to the labels (Stage 2) using LLMs.
Figure 2: Label merging granularity on five datasets. "GT #Clusters" means the ground truth number of clusters in the dataset.
Figure 3: ACC, NMI, ARI of our framework on five dataset with different percentage of given labels. 0% means no label is provided to the LLM, 20% means we give 20% of the total gold labels to the LLM during label generation and 100% means LLM is provided with all true labels and directly performs classification.

Text Clustering as Classification with LLMs

TL;DR

Abstract

Text Clustering as Classification with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (3)