Privately Aligning Language Models with Reinforcement Learning
Fan Wu, Huseyin A. Inan, Arturs Backurs, Varun Chandrasekaran, Janardhan Kulkarni, Robert Sim
TL;DR
This work tackles privacy concerns in aligning large language models with human feedback by designing a differential privacy framework for reinforcement learning (DP-RL). It introduces a three-stage pipeline—DP supervised fine-tuning, DP reward modeling, and DP policy optimization (DP-PPO, or DPPPO)—and proves that the final policy $\pi$ satisfies $(\epsilon,\delta)$-DP with respect to the private data. The approach leverages LoRA for privacy-friendly adaptation and uses DPSGD with privacy accounting and subsampling to enable private RL on sentiment-generation and summarization tasks. Overall, the results indicate that privately aligning instruction-following LLMs is feasible with competitive utility at moderate privacy budgets and that larger pre-trained models can improve the privacy-utility trade-off, suggesting DP-aligned RL as a viable path for privacy-preserving, human-aligned LLMs.
Abstract
Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
