Table of Contents
Fetching ...

Context-Aware Clustering using Large Language Models

Sindhu Tipirneni, Ravinarayana Adkathimar, Nurendra Choudhary, Gaurush Hiranandani, Rana Ali Amjad, Vassilis N. Ioannidis, Changhe Yuan, Chandan K. Reddy

TL;DR

The paper tackles supervised clustering of text-based entity subsets by leveraging context within the subset. It proposes CACTUS, which finetunes an open-source LLM using Scalable Inter-entity Attention (SIA), an augmented triplet loss with a neutral node, and a self-supervised clustering task to transfer clustering knowledge from a closed-source LLM to a scalable model. Experiments on e-commerce datasets show CACTUS outperforms both unsupervised baselines and prior supervised methods across standard external clustering metrics. The approach enables scalable, cost-effective clustering of entity subsets in real-world applications like product queries and recommendations.

Abstract

Despite the remarkable success of Large Language Models (LLMs) in text understanding and generation, their potential for text clustering tasks remains underexplored. We observed that powerful closed-source LLMs provide good quality clusterings of entity sets but are not scalable due to the massive compute power required and the associated costs. Thus, we propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS), a systematic approach that leverages open-source LLMs for efficient and effective supervised clustering of entity subsets, particularly focusing on text-based entities. Existing text clustering methods fail to effectively capture the context provided by the entity subset. Moreover, though there are several language modeling based approaches for clustering, very few are designed for the task of supervised clustering. This paper introduces a novel approach towards clustering entity subsets using LLMs by capturing context via a scalable inter-entity attention mechanism. We propose a novel augmented triplet loss function tailored for supervised clustering, which addresses the inherent challenges of directly applying the triplet loss to this problem. Furthermore, we introduce a self-supervised clustering task based on text augmentation techniques to improve the generalization of our model. For evaluation, we collect ground truth clusterings from a closed-source LLM and transfer this knowledge to an open-source LLM under the supervised clustering framework, allowing a faster and cheaper open-source model to perform the same task. Experiments on various e-commerce query and product clustering datasets demonstrate that our proposed approach significantly outperforms existing unsupervised and supervised baselines under various external clustering evaluation metrics.

Context-Aware Clustering using Large Language Models

TL;DR

The paper tackles supervised clustering of text-based entity subsets by leveraging context within the subset. It proposes CACTUS, which finetunes an open-source LLM using Scalable Inter-entity Attention (SIA), an augmented triplet loss with a neutral node, and a self-supervised clustering task to transfer clustering knowledge from a closed-source LLM to a scalable model. Experiments on e-commerce datasets show CACTUS outperforms both unsupervised baselines and prior supervised methods across standard external clustering metrics. The approach enables scalable, cost-effective clustering of entity subsets in real-world applications like product queries and recommendations.

Abstract

Despite the remarkable success of Large Language Models (LLMs) in text understanding and generation, their potential for text clustering tasks remains underexplored. We observed that powerful closed-source LLMs provide good quality clusterings of entity sets but are not scalable due to the massive compute power required and the associated costs. Thus, we propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS), a systematic approach that leverages open-source LLMs for efficient and effective supervised clustering of entity subsets, particularly focusing on text-based entities. Existing text clustering methods fail to effectively capture the context provided by the entity subset. Moreover, though there are several language modeling based approaches for clustering, very few are designed for the task of supervised clustering. This paper introduces a novel approach towards clustering entity subsets using LLMs by capturing context via a scalable inter-entity attention mechanism. We propose a novel augmented triplet loss function tailored for supervised clustering, which addresses the inherent challenges of directly applying the triplet loss to this problem. Furthermore, we introduce a self-supervised clustering task based on text augmentation techniques to improve the generalization of our model. For evaluation, we collect ground truth clusterings from a closed-source LLM and transfer this knowledge to an open-source LLM under the supervised clustering framework, allowing a faster and cheaper open-source model to perform the same task. Experiments on various e-commerce query and product clustering datasets demonstrate that our proposed approach significantly outperforms existing unsupervised and supervised baselines under various external clustering evaluation metrics.
Paper Structure (19 sections, 5 equations, 5 figures, 6 tables)

This paper contains 19 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the supervised clustering problem: Each training sample contains a subset of entities along with the corresponding ground truth clustering. Given a test sample, which is an unseen entity subset, the goal is to cluster the entities in the test sample. In a sample, color denotes a cluster, and shape denotes an entity.
  • Figure 2: Overview of CACTUS: The entities in the input subset are tokenized and passed through $\text{LLM}_\text{o}$, where the self-attention layers are modified with scalable inter-entity attention (SIA) to obtain context-aware entity embeddings. Pairwise cosine similarities are used for computing loss and predicted clusterings.
  • Figure 3: Example of an entity subset with 3 clusters containing 2 entities each. There exists an intra-cluster (yellow) edge with similarity less than some inter-cluster (green-blue) edges. For margin=0.3, the triplet loss (eq. \ref{['eq:triplet']}) is at its minimum while the proposed augmented triplet loss (eq. \ref{['eq:aug_triplet']}) is not.
  • Figure 4: GPU memory usage for inference using NIA, SIA (hid-mean), and FIA methods.
  • Figure 5: Case Study: Predicted clusterings with pairwise similarities using SIA and NIA methods. The SIA method correctly identifies the common cluster membership of the first two entities where NIA fails. The stopping threshold for agglomerative clustering is chosen based on the results of the validation set.