LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification
Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang, Dongsheng Li
TL;DR
This work tackles private medical text classification under limited public data by reframing DP-based synthetic data generation as a DP-based discrimination task. It introduces a three-component DP framework: an LLM-based public generator, a DP-based discriminator trained via knowledge distillation from multiple private-data teachers, and a DP-based label distribution tutor to steer generated samples toward the private data distribution with low privacy cost. The combination provides provable DP guarantees and empirically improves downstream classification accuracy on a medical transcription dataset, outperforming strong DP baselines and even the non-private upper bound in some settings due to higher quality and diversity of generated samples. The approach offers a practical path to privacy-preserving data augmentation in sensitive domains, enabling scalable private-domain text classification with controlled privacy leakage and improved utility.
Abstract
As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.
