Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Shengbo Wang; Nian Si; Jose Blanchet; Zhengyuan Zhou

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou

TL;DR

The paper addresses robust reinforcement learning under distributional shifts by formulating KL (and later χ²) divergence ambiguity sets around nominal environment dynamics and rewards. It proposes two model-free algorithms, DR Q-learning and its Variance-Reduced variant, and proves near-optimal, δ-independent minimax sample complexity bounds, with VR DR Q-learning achieving the best known horizon- and ε−dependent rates. The work shows that as δ→0 the complexity does not blow up, connecting robust and non-robust RL and enabling efficient learning under small adversarial power. Empirical results on hard MDPs corroborate the theory, demonstrating that variance reduction yields faster convergence and robustness to distributional shifts in practice, and the χ² extension broadens applicability to alternative ambiguity sets.

Abstract

Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the $q$-function of an infinite-horizon $γ$-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise $ε$-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of $\tilde O(|\mathbf{S}||\mathbf{A}|(1-γ)^{-4}ε^{-2})$, where $\mathbf{S}$ and $\mathbf{A}$ denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size $δ$, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

TL;DR

Abstract

-function of an infinite-horizon

-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise

-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of

, where

and

denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size

, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.

Paper Structure (55 sections, 40 theorems, 299 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 55 sections, 40 theorems, 299 equations, 4 figures, 2 tables, 2 algorithms.

Introduction
Our Motivation
Our Contributions
Literature Review
Distributionally Robust Reinforcement Learning
Classical Tabular Reinforcement Learning
Kullback-Leibler Divergence Constrained DR-RL
Dual and $q$-Function Formulations
Synchronous Q-Learning and Stochastic Approximation
The DR Q-Learning and Variance Reduction
The Distributionally Robust Q-learning
The Variance-Reduced Distributionally Robust Q-learning
Overview of the Analysis of Algorithms
Numerical Experiments
Hard MDPs for the Q-learning
...and 40 more sections

Key Result

Lemma 1

Let $X$ be a random variable and $\mu_0$ be a probability measure on $(\Omega,\mathcal{F})$ s.t. $X$ has a finite moment generating function in a neighborhood of zero. Then for any $\delta >0$,

Figures (4)

Figure 1: Hard MDP for the Q-learning transition diagram.
Figure 2: Convergence of Algorithm \ref{['alg:q-learning']} and \ref{['alg:vr_q-learning']} on the MDP instance \ref{['fig:hard_mdp_instance']}
Figure 3: Comparing the performance of Algorithm \ref{['alg:q-learning']}, \ref{['alg:vr_q-learning']} and the MLMC DR Q-learning on the MDP \ref{['fig:hard_mdp_instance']}.
Figure 4: Testing the sample complexity behavior as $\delta\downarrow 0$.

Theorems & Definitions (79)

Definition 1
Lemma 1: Hu2012KLDRO, Theorem 1
Definition 2
Definition 3
Definition 4
Definition 5
Proposition 3.1
Corollary 1
Theorem 1
proof
...and 69 more

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

TL;DR

Abstract

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (79)