Table of Contents
Fetching ...

Bayesian Risk-Sensitive Policy Optimization For MDPs With General Loss Functions

Xiaoshuang Wang, Yifan Lin, Enlu Zhou

TL;DR

This work addresses risk-sensitive planning for MDPs with general convex loss under epistemic uncertainty by introducing a Bayesian risk formulation that integrates a Bayesian posterior mu_N with a coherent risk measure rho. It develops a policy-gradient algorithm BR-PG that leverages the dual representation of risk measures and an extended envelope theorem to derive a gradient in continuous parameter spaces, with a gradient estimator based on a variational approach or zeroth-order methods. The authors establish a finite-time convergence rate of $\mathcal{O}(T^{-1/2}+r^{-1/2})$ and show global convergence in an episodic setting as data size grows, along with bounds on iterations to achieve an $\mathcal{O}(\epsilon)$-error per episode. Numerical experiments on Frozen Lake demonstrate BR-PG’s robustness across loss functions (e.g., linear, CVaR-derived) and better performance under limited data compared to empirical and distributionally robust baselines, highlighting practical impact for offline planning with general loss objectives. The framework advances convex RL by enabling policy-gradient methods to handle continuous risk envelopes and unknown environment parameters in a unified offline setting.

Abstract

Motivated by many application problems, we consider Markov decision processes (MDPs) with a general loss function and unknown parameters. To mitigate the epistemic uncertainty associated with unknown parameters, we take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional (with respect to the Bayesian posterior distribution) on the loss. Since this formulation usually does not satisfy the interchangeability principle, it does not admit Bellman equations and cannot be solved by approaches based on dynamic programming. Therefore, We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures and extending the envelope theorem to continuous cases. We then show the stationary analysis of the algorithm with a convergence rate of $\mathcal{O}(T^{-1/2}+r^{-1/2})$, where $T$ is the number of policy gradient iterations and $r$ is the sample size of the gradient estimator. We further extend our algorithm to an episodic setting, and establish the global convergence of the extended algorithm and provide bounds on the number of iterations needed to achieve an error bound $\mathcal{O}(ε)$ in each episode.

Bayesian Risk-Sensitive Policy Optimization For MDPs With General Loss Functions

TL;DR

This work addresses risk-sensitive planning for MDPs with general convex loss under epistemic uncertainty by introducing a Bayesian risk formulation that integrates a Bayesian posterior mu_N with a coherent risk measure rho. It develops a policy-gradient algorithm BR-PG that leverages the dual representation of risk measures and an extended envelope theorem to derive a gradient in continuous parameter spaces, with a gradient estimator based on a variational approach or zeroth-order methods. The authors establish a finite-time convergence rate of and show global convergence in an episodic setting as data size grows, along with bounds on iterations to achieve an -error per episode. Numerical experiments on Frozen Lake demonstrate BR-PG’s robustness across loss functions (e.g., linear, CVaR-derived) and better performance under limited data compared to empirical and distributionally robust baselines, highlighting practical impact for offline planning with general loss objectives. The framework advances convex RL by enabling policy-gradient methods to handle continuous risk envelopes and unknown environment parameters in a unified offline setting.

Abstract

Motivated by many application problems, we consider Markov decision processes (MDPs) with a general loss function and unknown parameters. To mitigate the epistemic uncertainty associated with unknown parameters, we take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional (with respect to the Bayesian posterior distribution) on the loss. Since this formulation usually does not satisfy the interchangeability principle, it does not admit Bellman equations and cannot be solved by approaches based on dynamic programming. Therefore, We propose a policy gradient optimization method, leveraging the dual representation of coherent risk measures and extending the envelope theorem to continuous cases. We then show the stationary analysis of the algorithm with a convergence rate of , where is the number of policy gradient iterations and is the sample size of the gradient estimator. We further extend our algorithm to an episodic setting, and establish the global convergence of the extended algorithm and provide bounds on the number of iterations needed to achieve an error bound in each episode.

Paper Structure

This paper contains 31 sections, 14 theorems, 87 equations, 5 figures, 3 tables, 4 algorithms.

Key Result

Theorem 1

(Theorem 6.6 in shapiro2021lectures.) A risk measure $\rho: \mathcal{Z} \rightarrow \mathbb{R}$ is coherent if and only if there exists a convex bounded and closed set (also known as risk envelope) $\mathcal{U}=\mathcal{U}(\mu_N) \subset \mathcal{B}$ such that $\rho(Z)=\max _{\xi: \xi \in \mathcal{

Figures (5)

  • Figure 1: Results for episodic case with different episode numbers and iterations per episode under the same escape probability $\theta_e=0.02$ and $50$ replications. Here the loss function is still chosen to be the linear loss. $95\%$ confidence intervals are reported by the shaded bands.
  • Figure 2: Results for loss function "KL Divergence" with data sizes $N=5$ and $50$ under $\theta_e=0.02$. $95\%$ confidence intervals are reported in the shaded area.
  • Figure 3: Map of frozen lake problem
  • Figure 4: Result for utility function "mean" with data size $N=5$ and escape probability $\theta_e=0.02$
  • Figure 5: Results for utility function "KL divergence" with data size $N=5$ and escape probability $\theta_e=0.2$ and $\theta_e=0.8$

Theorems & Definitions (29)

  • Theorem 1
  • Definition 3.1
  • Theorem 2
  • Lemma 3.1
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Corollary 1
  • Definition F.1
  • ...and 19 more