Table of Contents
Fetching ...

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

Zhishuai Liu, Weixin Wang, Pan Xu

TL;DR

This work tackles off-dynamics reinforcement learning under distributionally robust MDPs with linear function approximation, addressing online learning where training and deployment differ. It proposes We-DRIVE-U, a variance-aware, rare-switching algorithm that leverages variance-weighted ridge regression and a carefully constructed optimistic variance estimator to bound average suboptimality by $\widetilde{O}(dH\min\{1/\rho, H\}/\sqrt{K})$, plus a matching information-theoretic lower bound of $\Omega(dH^{1/2}\min\{1/\rho,H\}/\sqrt{K})$, making it near-optimal across $\rho\in(0,1]$. The method achieves low deployment costs with $\mathcal{O}(dH\log(1+H^2K))$ global policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ dual-optimization calls, improving over prior online DRMDP algorithms that scale with $K$. A novel hard instance demonstrates the problem’s intrinsic difficulty, and experiments on simulated linear DRMDPs show robust performance and substantially reduced switching, aligning with the theoretical guarantees. The results advance practical robust RL in high-dimensional settings by combining variance-aware estimation, dual optimization efficiency, and distributional robustness under TV-based uncertainty.

Abstract

We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/ρ, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ is the horizon length, $d$ is the feature dimension and $ρ$ is the uncertainty level. This result improves the state-of-the-art by $\mathcal{O}(dH/\min\{1/ρ,H\})$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to $\mathcal{O}(\sqrt{H})$ for any uncertainty level $ρ\in(0,1]$. Our algorithm also enjoys a 'rare-switching' design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both $\mathcal{O}(K)$.

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

TL;DR

This work tackles off-dynamics reinforcement learning under distributionally robust MDPs with linear function approximation, addressing online learning where training and deployment differ. It proposes We-DRIVE-U, a variance-aware, rare-switching algorithm that leverages variance-weighted ridge regression and a carefully constructed optimistic variance estimator to bound average suboptimality by , plus a matching information-theoretic lower bound of , making it near-optimal across . The method achieves low deployment costs with global policy switches and dual-optimization calls, improving over prior online DRMDP algorithms that scale with . A novel hard instance demonstrates the problem’s intrinsic difficulty, and experiments on simulated linear DRMDPs show robust performance and substantially reduced switching, aligning with the theoretical guarantees. The results advance practical robust RL in high-dimensional settings by combining variance-aware estimation, dual optimization efficiency, and distributional robustness under TV-based uncertainty.

Abstract

We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality , where is the number of episodes, is the horizon length, is the feature dimension and is the uncertainty level. This result improves the state-of-the-art by . We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to for any uncertainty level . Our algorithm also enjoys a 'rare-switching' design, and thus only requires policy switches and calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both .
Paper Structure (41 sections, 28 theorems, 162 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 28 theorems, 162 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.2

(Hardness result) There exists two $d$-rectangular linear DRMDPs $\{\mathcal{M}_0, \mathcal{M}_1\}$, such that $\inf_{\mathcal{A}\mathcal{L}\mathcal{G}}\sup_{\theta\in\{0,1\}}\mathbb{E}[\text{AveSubopt}^{\mathcal{M}_{\theta},\mathcal{A}\mathcal{L}\mathcal{G}}(K)] \geq \Omega(\rho\cdot H)$, where $\t

Figures (3)

  • Figure 1: The source and the target linear MDP environments. The value on each arrow represents the transition probability. For the source MDP, there are five states and three steps, with the initial state being $x_1$, the fail state being $x_4$, and $x_5$ being an absorbing state with reward 1. The target MDP on the right is obtained by perturbing the transition probability at the first step of the source MDP, with others remaining the same.
  • Figure 2: Simulation results under different source domains. The $x$-axis represents the perturbation level corresponding to different target environments. $\rho_{1,4}$ is the input uncertainty level for our We-DRIVE-U algorithm. $\Vert\xi\Vert_1$ is the hyperparameter of the linear DRMDP environment.
  • Figure 3: Constructions of the nominal MDP and the worst-case MDP environments.

Theorems & Definitions (38)

  • Proposition 3.2
  • Proposition 3.4: Remark 4.2 of liu2024distributionally
  • Remark 4.1
  • Remark 4.2
  • Remark 4.3
  • Theorem 5.1
  • Theorem 5.2
  • Remark 5.3
  • Remark 5.4
  • Theorem 5.5
  • ...and 28 more