Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

Zhishuai Liu; Weixin Wang; Pan Xu

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

Zhishuai Liu, Weixin Wang, Pan Xu

TL;DR

This work tackles off-dynamics reinforcement learning under distributionally robust MDPs with linear function approximation, addressing online learning where training and deployment differ. It proposes We-DRIVE-U, a variance-aware, rare-switching algorithm that leverages variance-weighted ridge regression and a carefully constructed optimistic variance estimator to bound average suboptimality by $\widetilde{O}(dH\min\{1/\rho, H\}/\sqrt{K})$, plus a matching information-theoretic lower bound of $\Omega(dH^{1/2}\min\{1/\rho,H\}/\sqrt{K})$, making it near-optimal across $\rho\in(0,1]$. The method achieves low deployment costs with $\mathcal{O}(dH\log(1+H^2K))$ global policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ dual-optimization calls, improving over prior online DRMDP algorithms that scale with $K$. A novel hard instance demonstrates the problem’s intrinsic difficulty, and experiments on simulated linear DRMDPs show robust performance and substantially reduced switching, aligning with the theoretical guarantees. The results advance practical robust RL in high-dimensional settings by combining variance-aware estimation, dual optimization efficiency, and distributional robustness under TV-based uncertainty.

Abstract

We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/ρ, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ is the horizon length, $d$ is the feature dimension and $ρ$ is the uncertainty level. This result improves the state-of-the-art by $\mathcal{O}(dH/\min\{1/ρ,H\})$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to $\mathcal{O}(\sqrt{H})$ for any uncertainty level $ρ\in(0,1]$. Our algorithm also enjoys a 'rare-switching' design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switches and $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both $\mathcal{O}(K)$.

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

TL;DR

, plus a matching information-theoretic lower bound of

, making it near-optimal across

. The method achieves low deployment costs with

global policy switches and

dual-optimization calls, improving over prior online DRMDP algorithms that scale with

. A novel hard instance demonstrates the problem’s intrinsic difficulty, and experiments on simulated linear DRMDPs show robust performance and substantially reduced switching, aligning with the theoretical guarantees. The results advance practical robust RL in high-dimensional settings by combining variance-aware estimation, dual optimization efficiency, and distributional robustness under TV-based uncertainty.

Abstract

, where

is the number of episodes,

is the horizon length,

is the feature dimension and

is the uncertainty level. This result improves the state-of-the-art by

. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to

for any uncertainty level

. Our algorithm also enjoys a 'rare-switching' design, and thus only requires

policy switches and

calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both

Paper Structure (41 sections, 28 theorems, 162 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 28 theorems, 162 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Novelty in Algorithm and Hard Instance Design
Technical Challenges
Notations
Related Work
Distributionally Robust MDPs
Online Linear MDPs and Linear DRMDPs
Preliminary
Algorithm Design
Variance-Weighted Ridge Regression for Online DRMDPs
Variance Estimator with Refined Dependence on Problem Parameters
Algorithm Interpretation
Theoretical Analysis
Discussion on the Tightness of the Upper and Lower Bounds
Experiments on Simulated Linear DRMDPs
...and 26 more sections

Key Result

Proposition 3.2

(Hardness result) There exists two $d$-rectangular linear DRMDPs $\{\mathcal{M}_0, \mathcal{M}_1\}$, such that $\inf_{\mathcal{A}\mathcal{L}\mathcal{G}}\sup_{\theta\in\{0,1\}}\mathbb{E}[\text{AveSubopt}^{\mathcal{M}_{\theta},\mathcal{A}\mathcal{L}\mathcal{G}}(K)] \geq \Omega(\rho\cdot H)$, where $\t

Figures (3)

Figure 1: The source and the target linear MDP environments. The value on each arrow represents the transition probability. For the source MDP, there are five states and three steps, with the initial state being $x_1$, the fail state being $x_4$, and $x_5$ being an absorbing state with reward 1. The target MDP on the right is obtained by perturbing the transition probability at the first step of the source MDP, with others remaining the same.
Figure 2: Simulation results under different source domains. The $x$-axis represents the perturbation level corresponding to different target environments. $\rho_{1,4}$ is the input uncertainty level for our We-DRIVE-U algorithm. $\Vert\xi\Vert_1$ is the hyperparameter of the linear DRMDP environment.
Figure 3: Constructions of the nominal MDP and the worst-case MDP environments.

Theorems & Definitions (38)

Proposition 3.2
Proposition 3.4: Remark 4.2 of liu2024distributionally
Remark 4.1
Remark 4.2
Remark 4.3
Theorem 5.1
Theorem 5.2
Remark 5.3
Remark 5.4
Theorem 5.5
...and 28 more

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

TL;DR

Abstract

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (38)