Table of Contents
Fetching ...

Preference Robustness for DPO with Applications to Public Health

Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe

TL;DR

This work tackles learning reward functions for sequential public-health decision tasks under noisy human preferences. It introduces DPO-PRO, a lightweight distributionally robust extension of Direct Preference Optimization that hedges the preference distribution using a chi-squared divergence without over-conservatism. The approach regularizes DPO by a term that depends on the soft preference score $q$, the robustness radius $\rho$, and the model's confidence, yielding calibrated and robust learning with low inference-time overhead. Empirically, DPO-PRO improves robustness on standard alignment benchmarks and on a real-world ARMMAN maternal-health RMAB task while remaining more scalable than self-reflection-based baselines, supporting reliable deployment in sensitive public-health contexts.

Abstract

We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.

Preference Robustness for DPO with Applications to Public Health

TL;DR

This work tackles learning reward functions for sequential public-health decision tasks under noisy human preferences. It introduces DPO-PRO, a lightweight distributionally robust extension of Direct Preference Optimization that hedges the preference distribution using a chi-squared divergence without over-conservatism. The approach regularizes DPO by a term that depends on the soft preference score , the robustness radius , and the model's confidence, yielding calibrated and robust learning with low inference-time overhead. Empirically, DPO-PRO improves robustness on standard alignment benchmarks and on a real-world ARMMAN maternal-health RMAB task while remaining more scalable than self-reflection-based baselines, supporting reliable deployment in sensitive public-health contexts.

Abstract

We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.

Paper Structure

This paper contains 31 sections, 2 theorems, 12 equations, 1 figure, 3 tables.

Key Result

Proposition 4.1

Eq eq:grad_est provides an unbiased gradient estimate of the DRO loss in Eq eq:dro_loss.

Figures (1)

  • Figure 1: Inference-time comparison of DPO-PRO and DLM across different population sizes (i.e., number of arms) in the underlying RMAB.

Theorems & Definitions (2)

  • Proposition 4.1
  • Proposition 4.2