Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Uri Gadot; Esther Derman; Navdeep Kumar; Maxence Mohamed Elfatihi; Kfir Levy; Shie Mannor

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, Shie Mannor

TL;DR

The paper tackles reward-robust MDPs with coupled reward uncertainty by focusing on $L_p$ norm balls around a nominal reward, while keeping the transition kernel fixed. It derives the worst-case reward in closed form and shows the robust return reduces to a regularized objective with a policy-visitation-frequency term, enabling a direct link to frequency regularization and enabling a convergent policy-gradient method. The authors prove convergence of the proposed robust policy gradient, provide an online actor-critic algorithm, and demonstrate through experiments that coupling rewards yields robustness with less conservatism than traditional rectangular uncertainty. This work broadens robust RL by removing rectangularity constraints, offering scalable, interpretable approaches for high-dimensional settings where reward misspecification is a concern.

Abstract

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an $α$-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

TL;DR

The paper tackles reward-robust MDPs with coupled reward uncertainty by focusing on

norm balls around a nominal reward, while keeping the transition kernel fixed. It derives the worst-case reward in closed form and shows the robust return reduces to a regularized objective with a policy-visitation-frequency term, enabling a direct link to frequency regularization and enabling a convergent policy-gradient method. The authors prove convergence of the proposed robust policy gradient, provide an online actor-critic algorithm, and demonstrate through experiments that coupling rewards yields robustness with less conservatism than traditional rectangular uncertainty. This work broadens robust RL by removing rectangularity constraints, offering scalable, interpretable approaches for high-dimensional settings where reward misspecification is a concern.

Abstract

-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

Paper Structure (28 sections, 30 theorems, 74 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 30 theorems, 74 equations, 4 figures, 7 tables, 1 algorithm.

Complexity Analysis
Connection between regularization and reward uncertainty set rectangularity
Proofs from Sec. \ref{['sec: solving reward rmdps']}: Analyzing Reward-Robust MDPs
Proof of Proposition \ref{['propos:non_rect_bell_operator']}
Proof of Lemma \ref{['lemma: stat policy enough']}
Proof of Lemma \ref{['lemma: duality']}
Proof of Theorem \ref{['rs:rr:worstReward']}
Proof of Corollary \ref{['cor: reward robust return']}
Extension to Weighted Lp norms
Proof of Theorem \ref{['rs:rr:rvi']}
Proof of Corollary \ref{['corollary:robust_Q_func']}
Proofs from Sec. \ref{['sec:lp_reward:policy_imporv']}: Reward-Robust Policy Gradient
Proof of Theorem \ref{['theorem:robust_pg']}
Proof of Lemma \ref{['rs:smoothness']}
Proof of Theorem \ref{['thm: rpg convergence']}
...and 13 more sections

Key Result

Proposition 1

For non-rectangular uncertainty set $\mathcal{R}$, the robust Bellman operator $\mathcal{T}^\pi_\mathcal{R}$ (resp., $\mathcal{T}^*_\mathcal{R}$ ) has $v^\pi_{C(\mathcal{R})}$ (resp., $v^*_{C(\mathcal{R})}$ ) as its fixed point, where $C(\mathcal{R})$ is the smallest $s$-rectangular uncertainty set

Figures (4)

Figure 1: An illustrative example of conservatism in a lower-dimensional context: When faced with an unfamiliar coupled uncertainty set (depicted in blue, see appendix for more info on this particular coupled set), we explore two potential modeling approaches. One involves an s-rectangular uncertainty set with a constant radius parameter $\alpha$ for each state independently (displayed in green). The other chooses a coupled uncertainty set (in red) with the same radius. By increasing $\alpha$ we are increasing conservativness. The rectangular set encompasses the actual uncertainty more swiftly. Nevertheless, this approach results in a rapid expansion of the uncertainty set to a considerable size. Conversely, the coupled set representation covers the genuine uncertainty set at a later point, yet it exhibits a lower degree of conservatism.
Figure 2: $CVaR_{5\%}$ results for different $\alpha$
Figure 3: Evaluation results on both environments for different reward perturbations.
Figure 4: $CVaR_{5\%}$ results for different $\alpha$, and different size of state space $\lvert\mathop{\mathrm{\mathcal{S}}}\nolimits\rvert$

Theorems & Definitions (51)

Proposition 1
Lemma 2: Stationary policies are enough
Lemma 3: Duality
Remark 4
Theorem 5: Worst-case reward
Corollary 6: Reward robust return
Theorem 7
Corollary 8
Theorem 9
Lemma 10: Smoothness
...and 41 more

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

TL;DR

Abstract

Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (51)