A Finite Sample Complexity Bound for Distributionally Robust Q-learning

Shengbo Wang; Nian Si; Jose Blanchet; Zhengyuan Zhou

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou

TL;DR

This paper addresses the challenge of distributional shifts between training simulators and deployment environments in reinforcement learning by developing a model-free, distributionally robust Q-learning algorithm with a finite-sample complexity guarantee. It extends a prior MLMC-based DR Bellman estimator to ensure a constant expected number of samples per iteration, proving unbiasedness and variance bounds within a stochastic-approximation framework. The main result is a finite-sample bound of $\tilde{O}(|S||A|(1-\gamma)^{-5}\epsilon^{-2}p_{\wedge}^{-6}\delta^{-4})$ for learning the robust $Q$-function to accuracy $\epsilon$, with the sample complexity scaling tight in $|S||A|$ and nearly tight in the effective horizon, plus empirical validation on hard MDPs and inventory control problems. The work delivers the first model-free finite-sample guarantee for distributionally robust RL and provides practical insights into step-size choices and estimator design for robust deployment. Overall, this advances robust RL by offering tractable, theory-backed guarantees and demonstrating improved robustness in simulation studies, enabling safer transfer to real-world settings.

Abstract

We consider a reinforcement learning setting in which the deployment environment is different from the training environment. Applying a robust Markov decision processes formulation, we extend the distributionally robust $Q$-learning framework studied in Liu et al. [2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust $Q$-function within an $ε$ error in the sup norm is upper bounded by $\tilde O(|S||A|(1-γ)^{-5}ε^{-2}p_{\wedge}^{-6}δ^{-4})$, where $γ$ is the discount rate, $p_{\wedge}$ is the non-zero minimal support probability of the transition kernels and $δ$ is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

TL;DR

for learning the robust

-function to accuracy

, with the sample complexity scaling tight in

and nearly tight in the effective horizon, plus empirical validation on hard MDPs and inventory control problems. The work delivers the first model-free finite-sample guarantee for distributionally robust RL and provides practical insights into step-size choices and estimator design for robust deployment. Overall, this advances robust RL by offering tractable, theory-backed guarantees and demonstrating improved robustness in simulation studies, enabling safer transfer to real-world settings.

Abstract

-learning framework studied in Liu et al. [2022]. Further, we improve the design and analysis of their multi-level Monte Carlo estimator. Assuming access to a simulator, we prove that the worst-case expected sample complexity of our algorithm to learn the optimal robust

-function within an

error in the sup norm is upper bounded by

, where

is the discount rate,

is the non-zero minimal support probability of the transition kernels and

is the uncertainty size. This is the first sample complexity result for the model-free robust RL problem. Simulation studies further validate our theoretical results.

Paper Structure (37 sections, 14 theorems, 154 equations, 5 figures, 1 algorithm)

This paper contains 37 sections, 14 theorems, 154 equations, 5 figures, 1 algorithm.

Introduction
Our Contributions
Related Work
Distributionally Robust Policy Learning Paradigm
Standard Policy Learning
Distributionally Robust Formulation
Strong Duality
Distributionally Robust $Q$-function and its Bellman Equation
$Q$-Learning in Distributionally Robust RL
A Review of Synchronized $Q$-Learning and Stochastic Approximations
Distributionally Robust $Q$-learning
Algorithm Complexity
Numerical Experiments
Hard MDPs for $Q$-learning
Lost-sale Inventory Control
...and 22 more sections

Key Result

Lemma 2.1

Suppose $H(X)$ has finite moment generating function in the neighborhood of zero. Then for any $\delta >0$,

Figures (5)

Figure 1: Hard MDP instances transition diagram.
Figure 2: Convergence of Algorithm \ref{['alg.Q_learning']} on MDP \ref{['fig:hard_mdp_instance']}
Figure 3: log averaged error against $\log(1-\gamma)$, the slopes of the regression line for iteration $k = 500,1000,1500$ are $-2.031, -2.007, -2.021$.
Figure 4: Algorithm comparison: inventory model.
Figure 5: Test convergence for different $\delta$

Theorems & Definitions (36)

Definition 1
Remark
Lemma 2.1: Hu2012KullbackLeiblerDC, Theorem 1
Definition 2
Definition 3
Proposition 3.1
Definition 4
Remark
Definition 5
Remark
...and 26 more

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

TL;DR

Abstract

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (36)