Table of Contents
Fetching ...

Large Language Models Enable Few-Shot Clustering

Vijay Viswanathan, Kiril Gashteovski, Carolin Lawrence, Tongshuang Wu, Graham Neubig

TL;DR

The paper investigates whether large language models can amplify expert guidance for few-shot semi-supervised text clustering. It proposes three practical integration points: enriching input representations via keyphrase expansion, supplying pseudo-oracle pairwise constraints during clustering, and post-hoc corrections guided by LLMs. Across entity canonicalization and text clustering tasks on multiple datasets, input expansion consistently improves cluster quality, with pseudo-oracle constraints offering strong gains given ample queries and post-correction yielding limited benefits. The results suggest a cost-effective, plug-in approach for improving clustering performance without requiring extensive human feedback, and the authors release code and prompts for public use.

Abstract

Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.

Large Language Models Enable Few-Shot Clustering

TL;DR

The paper investigates whether large language models can amplify expert guidance for few-shot semi-supervised text clustering. It proposes three practical integration points: enriching input representations via keyphrase expansion, supplying pseudo-oracle pairwise constraints during clustering, and post-hoc corrections guided by LLMs. Across entity canonicalization and text clustering tasks on multiple datasets, input expansion consistently improves cluster quality, with pseudo-oracle constraints offering strong gains given ample queries and post-correction yielding limited benefits. The results suggest a cost-effective, plug-in approach for improving clustering performance without requiring extensive human feedback, and the authors release code and prompts for public use.

Abstract

Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.
Paper Structure (31 sections, 5 figures, 5 tables)

This paper contains 31 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: In traditional semi-supervised clustering, a user provides a large amount of feedback to the clusterer. In our approach, the user prompts an LLM with a small amount of feedback. The LLM then generates a large amount of pseudo-feedback for the clusterer.
  • Figure 2: We expand document representations by concatenating them with keyphrase embeddings. The keyphrases are generated by a large language model.
  • Figure 3: After performing clustering, we identify low-confidence points. For these points, we ask an LLM whether the current cluster assignment is correct. If the LLM responds negatively, we ask the LLM whether this point should instead be linked to any of the top-5 nearest clusters, and correct the clustering accordingly.
  • Figure 4: Using the CMVC architecture, we encode a knowledge graph-based "fact view" and a text-based "context-view" to represent each entity.
  • Figure 5: Collecting more pseudo-oracle feedback for pairwise constraint K-Means on OPIEC59k improves the Macro F1 metric without reducing other metrics. Compared to the same algorithm with true oracle constraints, we see the sensitivity of this algorithm to a noisy oracle.