Table of Contents
Fetching ...

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Young Hyun Cho, Will Wei Sun

Abstract

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Abstract

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.
Paper Structure (49 sections, 14 theorems, 213 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 49 sections, 14 theorems, 213 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Lemma 2

For a fixed context $x$, reward $r$, and reference policy $\pi_0$, any maximizer $\pi_r^\eta\in\arg\max_\pi V_\eta(\pi;r)$ satisfies where $Z_r(x)=\mathbb{E}_{a \sim \pi_0(\cdot\mid x)}\!\left[\exp\!(\eta r(x,a))\right]$.

Figures (8)

  • Figure 1: A typical large language model adaptation pipeline. We focus on privacy during the preference fine-tuning stage, where sensitive user interactions can be directly reflected in training records.
  • Figure 2: A sensitive interaction record. Even without direct identifiers, prompts can contain quasi-identifiers whose combination may re-identify an individual, motivating tuple-level protection of $(x_i,a_i^1,a_i^2,y_i)$.
  • Figure 3: Phase diagram of the statistical and privacy errors on a log-log scale. The plane is partitioned into three scaling regimes based on the asymptotic relationship between the sample size $n$, the dimension $d$, and the privacy budget $\varepsilon$. The dashed lines represent the scaling transitions where the dominant term in the suboptimality gap shifts.
  • Figure 4: Convergence of Suboptimality Gap. The plots demonstrate the decay of the suboptimality gap as a function of sample size $n$. (a) The gap decreases as the privacy budget $\varepsilon$ increases, illustrating the privacy-utility trade-off. (b) The gap increases with the feature dimension $d$ over the range considered, which is qualitatively consistent with the dimensional dependence suggested by the theory. Shaded regions indicate 95% confidence intervals over 30 trials.
  • Figure 5: Synthetic $\eta$-sweep at $(\varepsilon,\delta)=(1,10^{-5})$ (fixed $d=7$). Top row: suboptimality gap $V_\eta(\pi_\eta^\star)-V_\eta(\hat{\pi})$. Bottom row: normalized gap. Baselines use $C=2L(d)$. Shaded regions indicate 95% confidence intervals over 30 trials.
  • ...and 3 more figures

Theorems & Definitions (23)

  • Definition 1: Bradley--Terry Model
  • Lemma 2: Policy Improvement Oracle
  • Definition 3: $(\varepsilon,\delta)$-Differential Privacy dwork2006calibrating
  • Example 4: DPO
  • Example 5: PPO-style policy optimization
  • Proposition 6: Privacy of the Framework
  • proof
  • Lemma 7: Utility of private projected SGD
  • Remark 8: Unconditional DP-SGD control
  • Theorem 9: Upper bound on the suboptimality gap
  • ...and 13 more