Table of Contents
Fetching ...

PROPS: Progressively Private Self-alignment of Large Language Models

Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon

TL;DR

The paper addresses privacy concerns in human-preference data used to align LLMs. It introduces PROPS, a progressively private self-alignment framework that partitions data, perturbs first-stage labels with randomized response, and uses an intermediate model to generate refined private labels via maximum likelihood estimation in a second stage. Theoretical guarantees on privacy and sub-optimality are provided, and extensive experiments show PROPS substantially improves privacy-utility over RR and DP-SGD, especially at high privacy levels, across multiple datasets and model families. The work offers a robust privacy-preserving approach to LLM alignment and suggests extensions to RLHF settings and broader applications.

Abstract

Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler's preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.

PROPS: Progressively Private Self-alignment of Large Language Models

TL;DR

The paper addresses privacy concerns in human-preference data used to align LLMs. It introduces PROPS, a progressively private self-alignment framework that partitions data, perturbs first-stage labels with randomized response, and uses an intermediate model to generate refined private labels via maximum likelihood estimation in a second stage. Theoretical guarantees on privacy and sub-optimality are provided, and extensive experiments show PROPS substantially improves privacy-utility over RR and DP-SGD, especially at high privacy levels, across multiple datasets and model families. The work offers a robust privacy-preserving approach to LLM alignment and suggests extensions to RLHF settings and broader applications.

Abstract

Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler's preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.

Paper Structure

This paper contains 17 sections, 2 theorems, 20 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

Lemma 1

For all $\delta'\ge 0$, PROPS framework satisfies $(\epsilon,0)$-Preference DP. If no labeler labels more than k prompt-response pairs in dataset $\mathcal{D}$, then PROPS satisfies $(\epsilon_{\text{Labeler}},\delta_{\text{Labeler}})$ Labeler DP, where $\epsilon_{\text{Labeler}}=k\epsilon^2+\epsilo

Figures (9)

  • Figure 1: (a) Randomized Response (RR) based alignment where human preferences in the dataset $\mathcal{D}$ are privatized using RR which are then used for alignment. (b) DP-SGD based alignment where differentially private gradients are used for model alignment. (c) Two stage PROPS framework: Dataset $\mathcal{D}$ is partitioned into disjoint subsets $(\mathcal{D}_1, \mathcal{D}_2)$. In Stage $1$, preferences in $\mathcal{D}_1$ are privatized using RR, resulting in an intermediate aligned model $M_1$. In Stage $2$, model $M_1$ is used to independently rank the responses in $\mathcal{D}_2$. We then obtain private labels for $\mathcal{D}_2$ which are derived from combining model's predictions and RR via a maximum likelihood estimator (MLE). These progressively refined private preferences are then used for alignment to arrive at the final model $M_2$.
  • Figure 2: (a) Win-Tie rate evaluation of PROPS vs RR and PROPS vs DP-SGD-aligned models on the truthy-dpo-v0.1 dataset for GPT2-Large and GPT2-Medium models, demonstrating the advantages of preference-level privacy with PROPS, particularly in high-privacy regimes. (b) Prompt-Response pairs generated by GPT2-Large model with PROPS, DP-SGD and RR-based alignment for different privacy regimes.
  • Figure 3: Key building blocks of PROPS framework. The figure illustrates the label generation of PROPS: In the first round, the human annotated labels $\ell^*$ are perturbed using RR ($\ell_{\text{RR}}$) which are then used to align model $M_1$. In every $(k+1)^{\text{th}}$ round, model $M_{k}$ predicted labels ($\ell_{M_k}$) and RR-based labels $\ell_{\text{RR}}$ are then selected based on MLE to achieve labels $\ell_{\text{PROPS}}$.
  • Figure 4: Prompt-Response pairs generated by PROPS and DP-SGD based GPT2-Large models and their corresponding scores (helpfulness and harmlessness). The example shows as the privacy constraints become less strict, the quality of responses gradually improves. More prompt-response examples are in Section \ref{['Appendix.A.8']} of the appendix.
  • Figure 5: The table represents the probability of observing $\ell_{RR}$ and $\ell_M$ based on the flipping probabilities $\gamma_{\epsilon}$ and $\gamma_{M_1}$ and true label $\ell^*$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 1: $(\epsilon, \delta)$ Differential Privacy
  • Definition 2: $(\epsilon,\delta)$-Preference level DP
  • Lemma 1
  • Theorem 1