Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning
Zhishuai Liu, Weixin Wang, Pan Xu
TL;DR
This work tackles off-dynamics reinforcement learning under distributionally robust MDPs with linear function approximation, addressing online learning where training and deployment differ. It proposes We-DRIVE-U, a variance-aware, rare-switching algorithm that leverages variance-weighted ridge regression and a carefully constructed optimistic variance estimator to bound average suboptimality by $\widetilde{O}(dH\min\{1/\rho, H\}/\sqrt{K})$, plus a matching information-theoretic lower bound of $\Omega(dH^{1/2}\min\{1/\rho,H\}/\sqrt{K})$, making it near-optimal across $\rho\in(0,1]$. The method achieves low deployment costs with $\mathcal{O}(dH\log(1+H^2K))$ global policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ dual-optimization calls, improving over prior online DRMDP algorithms that scale with $K$. A novel hard instance demonstrates the problem’s intrinsic difficulty, and experiments on simulated linear DRMDPs show robust performance and substantially reduced switching, aligning with the theoretical guarantees. The results advance practical robust RL in high-dimensional settings by combining variance-aware estimation, dual optimization efficiency, and distributional robustness under TV-based uncertainty.
Abstract
We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/ρ, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ is the horizon length, $d$ is the feature dimension and $ρ$ is the uncertainty level. This result improves the state-of-the-art by $\mathcal{O}(dH/\min\{1/ρ,H\})$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to $\mathcal{O}(\sqrt{H})$ for any uncertainty level $ρ\in(0,1]$. Our algorithm also enjoys a 'rare-switching' design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both $\mathcal{O}(K)$.
