Table of Contents
Fetching ...

Strategyproof Reinforcement Learning from Human Feedback

Thomas Kleine Buening, Jiarui Gan, Debmalya Mandal, Marta Kwiatkowska

TL;DR

This work analyzes reinforcement learning from human feedback (RLHF) in pluralistic settings where labelers may strategically misreport to sway the learned policy. It proves that existing RLHF approaches are not strategyproof and that even a single strategic labeler can cause substantial misalignment with social welfare; in the worst case, any strategyproof RLHF method must be $k$-times worse than the optimum where $k$ is the number of labelers. To address this, the authors introduce the Pessimistic Median of MLEs (Pessimistic MoMLEs), which achieves approximate strategyproofness with a bound that scales as $\tilde{O}(\kappa_i \sqrt{d/n})$ and guarantees welfare convergence as the number of labelers $k$ and samples $n$ increase, under a uniform policy-coverage assumption. The theory extends to both contextual bandits and MDPS, providing SubOpt and approximation-ratio bounds that quantify the trade-off between incentive alignment and policy alignment. The work also outlines extensions to non-linear rewards, different policy classes, and empirical validation, highlighting the practical relevance for robust, pluralistic AI alignment in real-world RLHF deployments.

Abstract

We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, where $k$ is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.

Strategyproof Reinforcement Learning from Human Feedback

TL;DR

This work analyzes reinforcement learning from human feedback (RLHF) in pluralistic settings where labelers may strategically misreport to sway the learned policy. It proves that existing RLHF approaches are not strategyproof and that even a single strategic labeler can cause substantial misalignment with social welfare; in the worst case, any strategyproof RLHF method must be -times worse than the optimum where is the number of labelers. To address this, the authors introduce the Pessimistic Median of MLEs (Pessimistic MoMLEs), which achieves approximate strategyproofness with a bound that scales as and guarantees welfare convergence as the number of labelers and samples increase, under a uniform policy-coverage assumption. The theory extends to both contextual bandits and MDPS, providing SubOpt and approximation-ratio bounds that quantify the trade-off between incentive alignment and policy alignment. The work also outlines extensions to non-linear rewards, different policy classes, and empirical validation, highlighting the practical relevance for robust, pluralistic AI alignment in real-world RLHF deployments.

Abstract

We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform -times worse than the optimal policy, where is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.

Paper Structure

This paper contains 42 sections, 26 theorems, 84 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Proposition 3.2

Existing RLHF methods such as Pessimistic Social Welfare zhong2024provable and MaxMin-RLHF chakrabortymaxmin are not strategyproof.

Figures (1)

  • Figure : Pessimistic Median of MLEs (Pessimistic MoMLEs)

Theorems & Definitions (49)

  • Definition 3.1: Strategyproofness
  • Definition 3.2: $\varepsilon$-Strategyproofness
  • Proposition 3.2
  • Proposition 3.2
  • proof : Proof Sketch
  • Theorem 3.3
  • proof : Proof Sketch
  • Corollary 3.3
  • Theorem 4.1
  • proof : Proof Sketch
  • ...and 39 more