Table of Contents
Fetching ...

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

TL;DR

The paper develops a principled theory for synthetic oversampling and augmentation to address data imbalance and spurious correlations, grounding LLM-based data generation in a risk framework that couples group-specific risks with a balanced risk via $R^{(g)}(\theta)$ and $R_{bal}(\theta)$. It proves excess-risk and scaling-law results for both oversampling and augmentation, clarifying how data quality and synthetic volume influence minority performance and spurious-dependence mitigation, and it establishes that transformers can act as high-quality data generators under identifiability conditions (with KL distances bounded by $D_{KL} \lesssim 1/\sqrt{d} + \log d/\sqrt{n}$). The numerical experiments corroborate these theories, showing that LLM-based oversampling and especially the combination with augmentation outperform standard oversampling methods and follow the predicted polynomial decay in error with respect to augmentation size. Together, these results offer a rigorous blueprint for leveraging LLMs to combat imbalance and spurious correlations in practical imbalanced-learning tasks. $R^{(g)}(\theta)$ and $R_{bal}(\theta)$ underpin the evaluation, while the scaling laws connect synthetic data volume and quality to improved generalization, informing how to allocate computational resources between seed data, synthetic generation, and augmentation.

Abstract

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

TL;DR

The paper develops a principled theory for synthetic oversampling and augmentation to address data imbalance and spurious correlations, grounding LLM-based data generation in a risk framework that couples group-specific risks with a balanced risk via and . It proves excess-risk and scaling-law results for both oversampling and augmentation, clarifying how data quality and synthetic volume influence minority performance and spurious-dependence mitigation, and it establishes that transformers can act as high-quality data generators under identifiability conditions (with KL distances bounded by ). The numerical experiments corroborate these theories, showing that LLM-based oversampling and especially the combination with augmentation outperform standard oversampling methods and follow the predicted polynomial decay in error with respect to augmentation size. Together, these results offer a rigorous blueprint for leveraging LLMs to combat imbalance and spurious correlations in practical imbalanced-learning tasks. and underpin the evaluation, while the scaling laws connect synthetic data volume and quality to improved generalization, informing how to allocate computational resources between seed data, synthetic generation, and augmentation.

Abstract

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.
Paper Structure (59 sections, 19 theorems, 256 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 59 sections, 19 theorems, 256 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose Assumptions asm: differentiability-asm: differentiability of ell hold for any $g \in {\mathcal{G}}$. Then, where

Figures (4)

  • Figure 1: Synthetic oversampling for imbalance classification. Six methods are compared: no oversampling (RAW), random oversampling (ROS), synthetic minority oversampling technique (SMOTE), adaptive synthetic sampling (ADASYN), the proposed synthetic oversampling (LLM oversampling), and the proposed synthetic oversampling with additional synthetic augmentation (LLM oversampling + augmentation). Two cross entropy losses are reported as the sample size of the majority and minority groups ${n_{\text{maj}}} / {n_{\text{min}}}$ varies.
  • Figure 2: Synthetic oversampling for spurious correlation with a similar setup as Figure \ref{['fig:ratio-imb-xgb']}.
  • Figure 3: Synthetic augmentation for imbalance classification. Two cross entropy losses (blue) are reported as the size of the additional augmented samples $N$ increase, all in the log scale. Two benchmark lines are also included, one without augmentation (green), and the other utilizing the original data (red).
  • Figure 4: Synthetic augmentation for spurious correlation with a similar setup as Figure \ref{['fig:scale-imb-xgb-loglog']}.

Theorems & Definitions (36)

  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • Corollary 3
  • Corollary 4
  • Theorem 3
  • proof : Proof of Theorem \ref{['thm: synthetic data risk']}
  • proof : Proof of Corollary \ref{['cor: synthetic data risk imbalanced']}
  • proof : Proof of Corollary \ref{['cor: synthetic data risk spurious']}
  • ...and 26 more