Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang
TL;DR
The paper develops a principled theory for synthetic oversampling and augmentation to address data imbalance and spurious correlations, grounding LLM-based data generation in a risk framework that couples group-specific risks with a balanced risk via $R^{(g)}(\theta)$ and $R_{bal}(\theta)$. It proves excess-risk and scaling-law results for both oversampling and augmentation, clarifying how data quality and synthetic volume influence minority performance and spurious-dependence mitigation, and it establishes that transformers can act as high-quality data generators under identifiability conditions (with KL distances bounded by $D_{KL} \lesssim 1/\sqrt{d} + \log d/\sqrt{n}$). The numerical experiments corroborate these theories, showing that LLM-based oversampling and especially the combination with augmentation outperform standard oversampling methods and follow the predicted polynomial decay in error with respect to augmentation size. Together, these results offer a rigorous blueprint for leveraging LLMs to combat imbalance and spurious correlations in practical imbalanced-learning tasks. $R^{(g)}(\theta)$ and $R_{bal}(\theta)$ underpin the evaluation, while the scaling laws connect synthetic data volume and quality to improved generalization, informing how to allocate computational resources between seed data, synthetic generation, and augmentation.
Abstract
Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.
