Table of Contents
Fetching ...

ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation

Fillipe dos Santos Silva, Gabriel Kenzo Kakimoto, Julio Cesar dos Reis, Marcelo S. Reis

TL;DR

ERASMO addresses the challenge of clustering heterogeneous tabular data by fine-tuning a transformer-based language model on textually encoded tabular representations, followed by embedding generation for clustering. The method introduces a textual converter, random feature sequence shuffles, and optional number verbalization to produce context-rich embeddings (ERASMObase and ERASMONV) that improve clustering quality. Across five real-world datasets and multiple clustering algorithms, ERASMO achieves state-of-the-art performance on internal metrics (SS, CHI, DBI), demonstrating dataset-specific contextual embeddings that capture complex patterns in tabular data. The framework also discusses practical considerations, including computational cost and metric limitations, and positions ERASMO as a promising foundation for clustering tools and downstream tasks like retrieval-augmented systems.

Abstract

Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing. These contexts often involve multimodal data, including both tabular and textual datasets, making it challenging to represent hidden patterns for obtaining meaningful clusters. This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded tabular data and generate embeddings from the fine-tuned model. ERASMO employs a textual converter to transform tabular data into a textual format, enabling the language model to process and understand the data more effectively. Additionally, ERASMO produces contextually rich and structurally representative embeddings through techniques such as random feature sequence shuffling and number verbalization. Extensive experimental evaluations were conducted using multiple datasets and baseline approaches. Our results demonstrate that ERASMO fully leverages the specific context of each tabular dataset, leading to more precise and nuanced embeddings for accurate clustering. This approach enhances clustering performance by capturing complex relationship patterns within diverse tabular data.

ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation

TL;DR

ERASMO addresses the challenge of clustering heterogeneous tabular data by fine-tuning a transformer-based language model on textually encoded tabular representations, followed by embedding generation for clustering. The method introduces a textual converter, random feature sequence shuffles, and optional number verbalization to produce context-rich embeddings (ERASMObase and ERASMONV) that improve clustering quality. Across five real-world datasets and multiple clustering algorithms, ERASMO achieves state-of-the-art performance on internal metrics (SS, CHI, DBI), demonstrating dataset-specific contextual embeddings that capture complex patterns in tabular data. The framework also discusses practical considerations, including computational cost and metric limitations, and positions ERASMO as a promising foundation for clustering tools and downstream tasks like retrieval-augmented systems.

Abstract

Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing. These contexts often involve multimodal data, including both tabular and textual datasets, making it challenging to represent hidden patterns for obtaining meaningful clusters. This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded tabular data and generate embeddings from the fine-tuned model. ERASMO employs a textual converter to transform tabular data into a textual format, enabling the language model to process and understand the data more effectively. Additionally, ERASMO produces contextually rich and structurally representative embeddings through techniques such as random feature sequence shuffling and number verbalization. Extensive experimental evaluations were conducted using multiple datasets and baseline approaches. Our results demonstrate that ERASMO fully leverages the specific context of each tabular dataset, leading to more precise and nuanced embeddings for accurate clustering. This approach enhances clustering performance by capturing complex relationship patterns within diverse tabular data.
Paper Structure (14 sections, 6 equations, 3 figures, 1 table)

This paper contains 14 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The ERASMO data pipeline for the fine-tuning phase. First, a textual converter step transforms tabular data into meaningful text (1). Next, a random feature order permutation step is applied (2). Then, based on user choice, the pipeline diverges: it can proceed directly to fine-tuning a LLM (3a) to generate ERASMObase, or apply a number verbalizer (3b) before fine-tuning the LLM (4b) to generate ERASMONV.
  • Figure 2: The ERASMO pipeline for generating embeddings and cluster analysis. The input test tabular data is first transformed into text sequences (1). Next, a random feature order permutation step is applied (2). For ERASMONV, a number verbalizer step follows (3) before processing by the fine-tuned LLM to generate embeddings (4). For ERASMObase, the pipeline goes directly from step (2) to step (4). The embeddings are subsequently used for clustering analysis.
  • Figure 3: t-SNE visualization of embedding representations on the Yelp dataset for different models : (a) MPNet-v2, (b) OpenAI, (c) LLaMA-2, (d) Falcon, (e) GPT2 Medium, (f) PMV2 + DICE, (g) ERASMObase, and (h) ERASMONV.

Theorems & Definitions (3)

  • definition thmcounterdefinition: Textual Converter
  • definition thmcounterdefinition: Random Feature Sequence Shuffle
  • definition thmcounterdefinition: Number Verbalizer