Table of Contents
Fetching ...

RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning

Yujie Zhao, Jose Efraim Aguilar Escamill, Weyl Lu, Huazheng Wang

TL;DR

Risk-AwarePbRL (RA-PbRL), an algorithm designed to optimize both nested and static objectives and a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and empirical results to support the findings.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has recently surged in popularity, particularly for aligning large language models and other AI systems with human intentions. At its core, RLHF can be viewed as a specialized instance of Preference-based Reinforcement Learning (PbRL), where the preferences specifically originate from human judgments rather than arbitrary evaluators. Despite this connection, most existing approaches in both RLHF and PbRL primarily focus on optimizing a mean reward objective, neglecting scenarios that necessitate risk-awareness, such as AI safety, healthcare, and autonomous driving. These scenarios often operate under a one-episode-reward setting, which makes conventional risk-sensitive objectives inapplicable. To address this, we explore and prove the applicability of two risk-aware objectives to PbRL : nested and static quantile risk objectives. We also introduce Risk-AwarePbRL (RA-PbRL), an algorithm designed to optimize both nested and static objectives. Additionally, we provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results to support our findings. Our code is available in https://github.com/aguilarjose11/PbRLNeurips.

RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning

TL;DR

Risk-AwarePbRL (RA-PbRL), an algorithm designed to optimize both nested and static objectives and a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and empirical results to support the findings.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has recently surged in popularity, particularly for aligning large language models and other AI systems with human intentions. At its core, RLHF can be viewed as a specialized instance of Preference-based Reinforcement Learning (PbRL), where the preferences specifically originate from human judgments rather than arbitrary evaluators. Despite this connection, most existing approaches in both RLHF and PbRL primarily focus on optimizing a mean reward objective, neglecting scenarios that necessitate risk-awareness, such as AI safety, healthcare, and autonomous driving. These scenarios often operate under a one-episode-reward setting, which makes conventional risk-sensitive objectives inapplicable. To address this, we explore and prove the applicability of two risk-aware objectives to PbRL : nested and static quantile risk objectives. We also introduce Risk-AwarePbRL (RA-PbRL), an algorithm designed to optimize both nested and static objectives. Additionally, we provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results to support our findings. Our code is available in https://github.com/aguilarjose11/PbRLNeurips.

Paper Structure

This paper contains 37 sections, 37 theorems, 104 equations, 6 figures, 1 algorithm.

Key Result

Lemma 3.6

For a given tabular MDP, the reward on the entire trajectory can be decomposed as $r_{\xi}^{\star}\left(\xi_{H}\right)=\sum_{h=1}^H r_h^{\star}\left(s_h,a_h\right)$, $V^\pi_{1}$ in Eq. bellman and $\Tilde{V}^\pi_1$ in Eq. bellman-pbrl are equivalent.

Figures (6)

  • Figure 1: Cumulative regret for static CVaR over different $\alpha$
  • Figure 2: Cumulative regret for nested CVaR over different $\alpha$.
  • Figure 3: Cumulative regret for static CVaR in the MuJoCo setting over different $\alpha$.
  • Figure 4: Cumulative regret for the different $\alpha$
  • Figure 5: Hard to learn case 1
  • ...and 1 more figures

Theorems & Definitions (60)

  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Lemma 3.6
  • Lemma 3.7
  • Theorem 4.1: Nested object regret upper bound
  • Theorem 4.2: Static object regret upper bound
  • Theorem 4.3: Nested object regret lower bound
  • Theorem 4.4: Static object regret lower bound
  • Theorem B.1
  • ...and 50 more