Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Ryumei Nakada; Yichen Xu; Lexin Li; Linjun Zhang

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

TL;DR

The paper develops a principled theory for synthetic oversampling and augmentation to address data imbalance and spurious correlations, grounding LLM-based data generation in a risk framework that couples group-specific risks with a balanced risk via $R^{(g)}(\theta)$ and $R_{bal}(\theta)$. It proves excess-risk and scaling-law results for both oversampling and augmentation, clarifying how data quality and synthetic volume influence minority performance and spurious-dependence mitigation, and it establishes that transformers can act as high-quality data generators under identifiability conditions (with KL distances bounded by $D_{KL} \lesssim 1/\sqrt{d} + \log d/\sqrt{n}$). The numerical experiments corroborate these theories, showing that LLM-based oversampling and especially the combination with augmentation outperform standard oversampling methods and follow the predicted polynomial decay in error with respect to augmentation size. Together, these results offer a rigorous blueprint for leveraging LLMs to combat imbalance and spurious correlations in practical imbalanced-learning tasks. $R^{(g)}(\theta)$ and $R_{bal}(\theta)$ underpin the evaluation, while the scaling laws connect synthetic data volume and quality to improved generalization, informing how to allocate computational resources between seed data, synthetic generation, and augmentation.

Abstract

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

TL;DR

and

. It proves excess-risk and scaling-law results for both oversampling and augmentation, clarifying how data quality and synthetic volume influence minority performance and spurious-dependence mitigation, and it establishes that transformers can act as high-quality data generators under identifiability conditions (with KL distances bounded by

). The numerical experiments corroborate these theories, showing that LLM-based oversampling and especially the combination with augmentation outperform standard oversampling methods and follow the predicted polynomial decay in error with respect to augmentation size. Together, these results offer a rigorous blueprint for leveraging LLMs to combat imbalance and spurious correlations in practical imbalanced-learning tasks.

and

underpin the evaluation, while the scaling laws connect synthetic data volume and quality to improved generalization, informing how to allocate computational resources between seed data, synthetic generation, and augmentation.

Abstract

Paper Structure (59 sections, 19 theorems, 256 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 59 sections, 19 theorems, 256 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Synthetic Oversampling and Augmentation
Problem setup
A general algorithm for synthetic data generation
Theory for Synthetic Oversampling and Augmentation
Regularity conditions
Theory for synthetic oversampling
Imbalanced classification.
Spurious correlation.
Theory for synthetic augmentation
Imbalanced classification.
Spurious correlation.
Theory for Transformers as Synthetic Data Generators
Data generating process
Background on transformers
...and 44 more sections

Key Result

Theorem 1

Suppose Assumptions asm: differentiability-asm: differentiability of ell hold for any $g \in {\mathcal{G}}$. Then, where

Figures (4)

Figure 1: Synthetic oversampling for imbalance classification. Six methods are compared: no oversampling (RAW), random oversampling (ROS), synthetic minority oversampling technique (SMOTE), adaptive synthetic sampling (ADASYN), the proposed synthetic oversampling (LLM oversampling), and the proposed synthetic oversampling with additional synthetic augmentation (LLM oversampling + augmentation). Two cross entropy losses are reported as the sample size of the majority and minority groups ${n_{\text{maj}}} / {n_{\text{min}}}$ varies.
Figure 2: Synthetic oversampling for spurious correlation with a similar setup as Figure \ref{['fig:ratio-imb-xgb']}.
Figure 3: Synthetic augmentation for imbalance classification. Two cross entropy losses (blue) are reported as the size of the additional augmented samples $N$ increase, all in the log scale. Two benchmark lines are also included, one without augmentation (green), and the other utilizing the original data (red).
Figure 4: Synthetic augmentation for spurious correlation with a similar setup as Figure \ref{['fig:scale-imb-xgb-loglog']}.

Theorems & Definitions (36)

Theorem 1
Corollary 1
Corollary 2
Theorem 2
Corollary 3
Corollary 4
Theorem 3
proof : Proof of Theorem \ref{['thm: synthetic data risk']}
proof : Proof of Corollary \ref{['cor: synthetic data risk imbalanced']}
proof : Proof of Corollary \ref{['cor: synthetic data risk spurious']}
...and 26 more

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

TL;DR

Abstract

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (36)