Privacy-Preserving Instructions for Aligning Large Language Models

Da Yu; Peter Kairouz; Sewoong Oh; Zheng Xu

Privacy-Preserving Instructions for Aligning Large Language Models

Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu

TL;DR

This work tackles privacy risks in aligning LLMs with user instructions by replacing real instructions with differentially private (DP) synthetic alternatives generated from privately fine-tuned generators. It introduces a two-stage framework: (1) privately fine-tune a generator to synthesize a large set of instructions, and (2) privately resample these instructions using a histogram-based distribution match in embedding space, achieving an end-to-end DP budget of about $\varepsilon \approx 5.98$. Empirical results show that DP synthetic instructions can achieve comparable or even superior utility to real instructions in supervised fine-tuning and RLHF, with notable gains from the filtering step and larger pre-trained models. The approach enables privacy-preserving instruction alignment with practical performance, reducing memorization risks while maintaining effectiveness for downstream LLM behavior.

Abstract

Service providers of large language model (LLM) applications collect user instructions in the wild and use them in further aligning LLMs with users' intentions. These instructions, which potentially contain sensitive information, are annotated by human workers in the process. This poses a new privacy risk not addressed by the typical private optimization. To this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. Formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. Crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. In both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. In supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as Vicuna.

Privacy-Preserving Instructions for Aligning Large Language Models

TL;DR

. Empirical results show that DP synthetic instructions can achieve comparable or even superior utility to real instructions in supervised fine-tuning and RLHF, with notable gains from the filtering step and larger pre-trained models. The approach enables privacy-preserving instruction alignment with practical performance, reducing memorization risks while maintaining effectiveness for downstream LLM behavior.

Abstract

Paper Structure (30 sections, 1 equation, 18 figures, 10 tables, 2 algorithms)

This paper contains 30 sections, 1 equation, 18 figures, 10 tables, 2 algorithms.

Introduction
Privacy Risks and Background
Generating Synthetic Instructions with Differential Privacy
Stage 1: DP Instruction Generator
Stage 2: Resample with DP Histogram
Experiments
Setup for Generating Synthetic Instructions
Measuring the Distributional Gap
Supervised Fine-tuning
RLHF with Proximal Policy Optimization
Conclusion
Additional Related Work
Real-world User Instructions Are Sensitive
Empirical Privacy Leakage
Ablation Studies
...and 15 more sections

Figures (18)

Figure 1: Privacy concerns of collecting and training with user instructions in LLM applications.
Figure 2: Samples of instructions containing personal information. We mask the sensitive texts to protect user privacy.
Figure 3: Our two-stage framework for privately generating high-quality synthetic instructions.
Figure 4: Probability densities of clusters of synthetic instructions. The black line shows the sorted votes from real samples (before noising). The filtering process aligns the distribution of synthetic instructions with that of real instructions.
Figure 5: Running Algorithm \ref{['alg:dp_filt']} with different $K$ and $\sigma$. The MAUVE scores initially improve with an increase in $K$, then they either start to plateau or decline.
...and 13 more figures

Theorems & Definitions (1)

Definition 2.1: $(\varepsilon,\delta)$-Differential Privacy

Privacy-Preserving Instructions for Aligning Large Language Models

TL;DR

Abstract

Privacy-Preserving Instructions for Aligning Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)

Theorems & Definitions (1)