Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Young Hyun Cho; Will Wei Sun

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Young Hyun Cho, Will Wei Sun

Abstract

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Abstract

Paper Structure (49 sections, 14 theorems, 213 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 49 sections, 14 theorems, 213 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Our Contributions
Related Work
Paper organization and notation
Preliminaries
Reinforcement Learning from Human Feedback
Differential Privacy
Proposed Method
Motivation: Challenges in Private Policy Optimization
Proposed Framework: Private Reward-Based Alignment
Theoretical Analysis
Upper Bound on the Suboptimality Gap
Minimax Lower Bound on the Suboptimality Gap
Rate-optimality
Numerical Studies
...and 34 more sections

Key Result

Lemma 2

For a fixed context $x$, reward $r$, and reference policy $\pi_0$, any maximizer $\pi_r^\eta\in\arg\max_\pi V_\eta(\pi;r)$ satisfies where $Z_r(x)=\mathbb{E}_{a \sim \pi_0(\cdot\mid x)}\!\left[\exp\!(\eta r(x,a))\right]$.

Figures (8)

Figure 1: A typical large language model adaptation pipeline. We focus on privacy during the preference fine-tuning stage, where sensitive user interactions can be directly reflected in training records.
Figure 2: A sensitive interaction record. Even without direct identifiers, prompts can contain quasi-identifiers whose combination may re-identify an individual, motivating tuple-level protection of $(x_i,a_i^1,a_i^2,y_i)$.
Figure 3: Phase diagram of the statistical and privacy errors on a log-log scale. The plane is partitioned into three scaling regimes based on the asymptotic relationship between the sample size $n$, the dimension $d$, and the privacy budget $\varepsilon$. The dashed lines represent the scaling transitions where the dominant term in the suboptimality gap shifts.
Figure 4: Convergence of Suboptimality Gap. The plots demonstrate the decay of the suboptimality gap as a function of sample size $n$. (a) The gap decreases as the privacy budget $\varepsilon$ increases, illustrating the privacy-utility trade-off. (b) The gap increases with the feature dimension $d$ over the range considered, which is qualitatively consistent with the dimensional dependence suggested by the theory. Shaded regions indicate 95% confidence intervals over 30 trials.
Figure 5: Synthetic $\eta$-sweep at $(\varepsilon,\delta)=(1,10^{-5})$ (fixed $d=7$). Top row: suboptimality gap $V_\eta(\pi_\eta^\star)-V_\eta(\hat{\pi})$. Bottom row: normalized gap. Baselines use $C=2L(d)$. Shaded regions indicate 95% confidence intervals over 30 trials.
...and 3 more figures

Theorems & Definitions (23)

Definition 1: Bradley--Terry Model
Lemma 2: Policy Improvement Oracle
Definition 3: $(\varepsilon,\delta)$-Differential Privacy dwork2006calibrating
Example 4: DPO
Example 5: PPO-style policy optimization
Proposition 6: Privacy of the Framework
proof
Lemma 7: Utility of private projected SGD
Remark 8: Unconditional DP-SGD control
Theorem 9: Upper bound on the suboptimality gap
...and 13 more

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Abstract

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (23)