PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Sudip Bhujel

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Sudip Bhujel

TL;DR

PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue, is presented, and an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling is introduced.

Abstract

Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

TL;DR

Abstract

achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

Paper Structure (45 sections, 8 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 45 sections, 8 equations, 2 figures, 7 tables, 1 algorithm.

Introduction
Methods
Dataset and Preprocessing
Preference Pair Construction
Supervised Fine-Tuning
Differentially Private RLHF
DP-Reward Modeling
Policy Optimization with Firewall
Evaluation and Deployment
Evaluation
Deployment Considerations
Implementation
Experimental Setup
Evaluation Metrics
ROUGE-L
...and 30 more sections

Figures (2)

Figure 1: A standard medical LLM that is fine-tuned without privacy safeguards may disclose the membership for patients with rare symptoms in its training set. In contrast, PrivMedChat's application of differential privacy prevents high-confidence inferences.
Figure 2: Data pre-processing and system overview of PrivMedChat. (a) methodology for annotation-free medical preference pair construction (b) illustrates the three-zone design separating DP-protected training (Zone 1) from evaluation and deployment (Zones 2--3).

Theorems & Definitions (1)

Definition 1: ($\varepsilon$, $\delta$)-Differential Privacy dwork2014algorithmic

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

TL;DR

Abstract

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (1)