Distributionally Robust Model-based Reinforcement Learning with Large State Spaces

Shyam Sundhar Ramesh; Pier Giuseppe Sessa; Yifan Hu; Andreas Krause; Ilija Bogunovic

Distributionally Robust Model-based Reinforcement Learning with Large State Spaces

Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, Ilija Bogunovic

TL;DR

The paper tackles distributionally robust reinforcement learning in continuous, high-dimensional state spaces with potential sim-to-real gaps. It introduces a model-based approach that learns nominal transition dynamics via a multi-output Gaussian Process and a Maximum Variance Reduction strategy, then optimizes robust policies within KL, χ², or TV uncertainty sets. The authors establish novel finite-sample complexity bounds that scale with information-theoretic quantities rather than state-space size, and demonstrate strong empirical robustness and data efficiency on Pendulum, Cartpole, and Reacher benchmarks. This work advances robust RL for large-scale, non-linear dynamics by enabling near-optimal policies with limited simulator interactions and providing a principled framework to adapt to distributional shifts in real-world deployment.

Abstract

Three major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment. To overcome these issues, we study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback-Leibler, chi-square, and total variation uncertainty sets. We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics, leveraging access to a generative model (i.e., simulator). We further demonstrate the statistical sample complexity of the proposed method for different uncertainty sets. These complexity bounds are independent of the number of states and extend beyond linear dynamics, ensuring the effectiveness of our approach in identifying near-optimal distributionally-robust policies. The proposed method can be further combined with other model-free distributionally robust reinforcement learning methods to obtain a near-optimal robust policy. Experimental results demonstrate the robustness of our algorithm to distributional shifts and its superior performance in terms of the number of samples needed.

Distributionally Robust Model-based Reinforcement Learning with Large State Spaces

TL;DR

Abstract

Paper Structure (21 sections, 30 theorems, 158 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 30 theorems, 158 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Main Contributions
Problem Setting
Sampling Algorithm
Sample Complexity
Experiments
Environments:
Training:
Evaluation:
Conclusions
Acknowledgements
Theoretical Guarantees of Maximum Variance Reduction (MVR)
Gaussian Process Model
Non-adaptive Multi-output Confidence Bounds
...and 6 more sections

Key Result

Lemma 1

For $\beta_{n}(\delta)$ set as in lemma:confidence_single_output_discretized and $\mathcal{I}_{d}$ denoting $\{1,2,\cdots,d\}$, the MVR algorithm (alg: mgpbo) outputs the dynamics estimate $\hat{f}_n(\cdot,\cdot) = \mu_{n}(\cdot, \cdot)$ such that the following holds uniformly for all $(s,a)\in \mat

Figures (4)

Figure 1: Average performance (over 20 episodes) on the considered environments, as a function of different perturbations: length perturbation for Pendulum, force magnitude perturbation for Cartpole, and perturbed joint stiffness for Reacher. Unlike our MVR+RFQI and non-robust MVR+FQI, the other baselines are model-free and require access to the true nominal environment for training. The proposed approach MVR+RFQI achieves comparable performance to the model-free RFQI albeit requiring significantly fewer environment interactions (see Table \ref{['tab:num_samples']}). Moreover, as the perturbation magnitude increases, MVR+RFQI outperforms the other non-robust baselines.
Figure 2: Pendulum experiments.
Figure 3: Cartpole experiments.
Figure 4: Reacher experiments with 'Springref' parameter set to 50 (left) or 100 (right).

Theorems & Definitions (51)

Lemma 1
Theorem 1
Lemma 2
Proposition 1
Proposition 2
Lemma 3
Lemma 4
Lemma 5
proof
Lemma 5
...and 41 more

Distributionally Robust Model-based Reinforcement Learning with Large State Spaces

TL;DR

Abstract

Distributionally Robust Model-based Reinforcement Learning with Large State Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (51)