Table of Contents
Fetching ...

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

Wenhao Wang, Xiaoyu Liang, Rui Ye, Jingyi Chai, Siheng Chen, Yanfeng Wang

TL;DR

This work proposes KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy, and inspired by federated learning, transmit models rather than data between the client and server to prevent privacy leakage.

Abstract

The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG.

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

TL;DR

This work proposes KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy, and inspired by federated learning, transmit models rather than data between the client and server to prevent privacy leakage.

Abstract

The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG.
Paper Structure (78 sections, 10 figures, 14 tables)

This paper contains 78 sections, 10 figures, 14 tables.

Figures (10)

  • Figure 1: The dilemma of current synthetic data methods. API-based methods involve more privacy risks while methods based on local models face performance degradation due to lower synthetic data quality.
  • Figure 2: Overview of KnowledgeSG's system architecture. $\mathbb{W}_{Loc}$: the local base model; $\mathbb{W}_{DP}$: DP-finetuned $\mathbb{W}_{Loc}$; $\mathbb{W}_{Target}$: the final target model; $\mathbb{W}_{Pro}$: the professional model. From left to right, $\mathbb{W}_{Loc}$ learns knowledge from private data on the client side and acquires knowledge distillation from $\mathbb{W}_{Pro}$ on the server side.
  • Figure 3: Instruction following difficulty of different baselines exploiting Llama2-7B as the base model. Lower IFD score indicates better quality of synthetic data. We evaluate on the synthetic datasets which are generated during the experiments in Section \ref{['sec:medical_freeform']}.
  • Figure 4: Examples of individual names contained in the ICliniq dataset li2023chatdoctor. Individual names as one form of PII, can be used to identify corresponding individuals. For anonymity, we substitute the original names with synthetic ones as mentioned in Appendix \ref{['sec:name_substitution']}.
  • Figure 5: Illustration of our identified gap between model comprehension and data complexity. We make an analogy by describing a situation where a student is asked to create a new question based on given examples.
  • ...and 5 more figures