Table of Contents
Fetching ...

TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

I-Fan Lin, Faegheh Hasibi, Suzan Verberne

TL;DR

TWIST tackles short-text clustering under realistic constraints: unlabeled data and unknown cluster count. It builds sparse vectors from representative texts and refines them iteratively with guidance from small LLMs, without requiring labels or prior knowledge of $\hat{K}$. Across diverse datasets and embedders, TWIST matches or surpasses state-of-the-art methods that rely on labeling or fine-tuning, and scales to large datasets via distillation. This approach offers a practical, low-resource pathway for real-world clustering in dynamic domains like customer support. The combination of a principled sparse-vector representation, LLM-guided refinement, and scalable implementation makes TWIST adaptable and impactful for production systems.

Abstract

In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.

TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

TL;DR

TWIST tackles short-text clustering under realistic constraints: unlabeled data and unknown cluster count. It builds sparse vectors from representative texts and refines them iteratively with guidance from small LLMs, without requiring labels or prior knowledge of . Across diverse datasets and embedders, TWIST matches or surpasses state-of-the-art methods that rely on labeling or fine-tuning, and scales to large datasets via distillation. This approach offers a practical, low-resource pathway for real-world clustering in dynamic domains like customer support. The combination of a principled sparse-vector representation, LLM-guided refinement, and scalable implementation makes TWIST adaptable and impactful for production systems.

Abstract

In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.

Paper Structure

This paper contains 43 sections, 1 theorem, 2 equations, 5 figures, 15 tables.

Key Result

Proposition 1

Given the assumption, for any $x_i, x_j \in D$ with $i \neq j$ that belong to the same cluster $S_k$, the corresponding sparse vectors $\mathbf{z}_i$ and $\mathbf{z}_j$ will be identical.

Figures (5)

  • Figure 1: Construction of Sparse vectors. Initial stage: We first encode each text in $D$ using a pretrained embedder and partition them into $d$ clusters. We then select the medoid of each cluster as the representative text and construct one-hot vectors for these representatives. The remaining text vectors are obtained based on LLM selection. Iterative stage: We update all of the vectors iteratively until convergence.
  • Figure 2: ACC and NMI across all test datasets (Emerging Data and GCD) for three LLMs. Results for K-means are averaged over 5 runs, HDBSCAN from a single run.
  • Figure 3: Comparison of (a) varying negihbor $m$ and (b) iterative updates on ACC and NMI across all test datasets (Emerging Data and GCD).
  • Figure 4: The average ACC and NMI across two datasets for different $d$. Results for K-means are averaged over 5 runs, while HDBSCAN is shown from a single run. $4096$ here means $\min\{4096, N\}$ because the dataset size can be smaller.
  • Figure 5: Ratio of texts converged at iteration $t$. All-MiniLM-L6-v2 (as embedder) and Qwen3-8B (as LLM) were used.

Theorems & Definitions (1)

  • Proposition 1