Table of Contents
Fetching ...

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang, Dongsheng Li

TL;DR

This work tackles private medical text classification under limited public data by reframing DP-based synthetic data generation as a DP-based discrimination task. It introduces a three-component DP framework: an LLM-based public generator, a DP-based discriminator trained via knowledge distillation from multiple private-data teachers, and a DP-based label distribution tutor to steer generated samples toward the private data distribution with low privacy cost. The combination provides provable DP guarantees and empirically improves downstream classification accuracy on a medical transcription dataset, outperforming strong DP baselines and even the non-private upper bound in some settings due to higher quality and diversity of generated samples. The approach offers a practical path to privacy-preserving data augmentation in sensitive domains, enabling scalable private-domain text classification with controlled privacy leakage and improved utility.

Abstract

As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

TL;DR

This work tackles private medical text classification under limited public data by reframing DP-based synthetic data generation as a DP-based discrimination task. It introduces a three-component DP framework: an LLM-based public generator, a DP-based discriminator trained via knowledge distillation from multiple private-data teachers, and a DP-based label distribution tutor to steer generated samples toward the private data distribution with low privacy cost. The combination provides provable DP guarantees and empirically improves downstream classification accuracy on a medical transcription dataset, outperforming strong DP baselines and even the non-private upper bound in some settings due to higher quality and diversity of generated samples. The approach offers a practical path to privacy-preserving data augmentation in sensitive domains, enabling scalable private-domain text classification with controlled privacy leakage and improved utility.

Abstract

As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.
Paper Structure (41 sections, 13 equations, 5 figures, 2 tables)

This paper contains 41 sections, 13 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our framework. It mainly contains three components, LLM-based Public Generator (grey block) generates public data. DP-based Discriminator with Knowledge Distillation (pink block) discriminates public data and obtains a probability similar to private data. The Label Distribution Tutor (blue block) selects a subset with the highest probabilities of samples matching the noise label distribution. The gray dotted box is the privacy block.
  • Figure 2: The private-utility tradeoff in accuracy across three DA w/ DP methods at varying $\varepsilon$. The vertical axis represents the accuracy on downstream text classification tasks.
  • Figure 3: The private-utility tradeoff in the discriminator's accuracy across three DA w/ DP methods at varying $\varepsilon$. The vertical axis represents the accuracy on our constructed test set.
  • Figure 4: Analysis on teacher number. The vertical axis represents the prediction accuracy of the discriminator on our constructed test set.
  • Figure 5: Label distributions. The horizontal axis enumerates all data labels, while the vertical axis represents the frequency of the labels.