Table of Contents
Fetching ...

Large Language Models for Imbalanced Classification: Diversity makes the difference

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Taylor Braund, Alexis Whitton, Svetha Venkatesh

TL;DR

This work tackles imbalanced classification on tabular data by leveraging large language models to generate diverse minority samples. It introduces ImbLLM, which combines (i) sampling conditioned on both minority labels and feature-value pairs, (ii) a fixed-Y permutation to enhance token interactions, and (iii) fine-tuning on minority samples plus interpolated samples to broaden variability. The authors prove, via entropy-based analysis, that their design yields higher distribution diversity and demonstrate state-of-the-art or near-state-of-the-art performance across 10 real-world datasets against eight baselines, with improved sample quality and coverage. The approach provides a principled, scalable path to robust minority-tone data augmentation in tabular domains, reducing reliance on verification steps and improving generalization to unseen data.

Abstract

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

Large Language Models for Imbalanced Classification: Diversity makes the difference

TL;DR

This work tackles imbalanced classification on tabular data by leveraging large language models to generate diverse minority samples. It introduces ImbLLM, which combines (i) sampling conditioned on both minority labels and feature-value pairs, (ii) a fixed-Y permutation to enhance token interactions, and (iii) fine-tuning on minority samples plus interpolated samples to broaden variability. The authors prove, via entropy-based analysis, that their design yields higher distribution diversity and demonstrate state-of-the-art or near-state-of-the-art performance across 10 real-world datasets against eight baselines, with improved sample quality and coverage. The approach provides a principled, scalable path to robust minority-tone data augmentation in tabular domains, reducing reliance on verification steps and improving generalization to unseen data.

Abstract

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

Paper Structure

This paper contains 35 sections, 3 theorems, 26 equations, 7 figures, 5 tables.

Key Result

Proposition 1

Let $C$ be the fixed condition "$Y\text{ is }y_{minor}$". Let $R_{i}$ be a random combination of a feature and its value "$X_{i}\text{ is }v_{i}$". Injecting randomness into the condition $C$ by concatenating random combinations $\{R_{i}\}_{i=1}^{n}$ increases the diversity of the set of generated s

Figures (7)

  • Figure 1: Training and evaluation of an oversampling method. Training: the oversampling method learns from the imbalanced dataset ${\cal D}_{train}$ to generate a synthetic minority dataset $\hat{{\cal D}}_{minor}$ to create the rebalanced dataset $\hat{{\cal D}}_{train}$. Evaluation:$\hat{{\cal D}}_{train}$ is used to train a ML classifier (e.g. XGBoost). The classifier is evaluated on a held-out test set ${\cal D}_{test}$ to compute F1-score and AUC. A better score implies a better oversampling method.
  • Figure 2: Attention matrix. Existing LLM-based methods use permute_xy to shuffle both $X$ and $Y$, leading to no attention scores between $y_{minor}$ and some features (a). Our method uses fix_y to permute only $X$ while fixing $Y$ at the beginning, resulting in attention scores between $y_{minor}$ and all features (b).
  • Figure 3: Overview of our ImbLLM. First step: it converts each row to a sentence (a), permutes $X$ and fixes $Y$ at the beginning (b), and fine-tunes an LLM with minority samples ${\cal D}_{minor}$ and interpolated samples ${\cal D}_{inter}$ (c). Second step: it constructs prompts based on both minority label and features to query the fine-tuned LLM to generate minority samples $\hat{x}$.
  • Figure 4: The average quality and diversity scores. $\uparrow$ means “higher is better” .
  • Figure 5: F1-score vs. interpolation-ratio over 10 datasets.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Proposition 3