Single-Trajectory Distributionally Robust Reinforcement Learning

Zhipeng Liang; Xiaoteng Ma; Jose Blanchet; Jiheng Zhang; Zhengyuan Zhou

Single-Trajectory Distributionally Robust Reinforcement Learning

Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jiheng Zhang, Zhengyuan Zhou

TL;DR

This paper designs a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ), and delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment.

Abstract

To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.

Single-Trajectory Distributionally Robust Reinforcement Learning

TL;DR

Abstract

Paper Structure (25 sections, 9 theorems, 72 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 25 sections, 9 theorems, 72 equations, 8 figures, 1 table, 2 algorithms.

Introduction
Our Contributions
Related Work
Preliminary
Discounted MDPs
$Q$-learning
Distributionally Robust MDPs
Distributonally Robust $Q$-learning with Single Trajectory
Divergence Families
Bias in Plug-in Estimator in Single Trajectory Setting
Three-timescale Framework
Algorithmic Design
Experiments
Convergence and Sample Complexity
Practical Implementation
...and 10 more sections

Key Result

Lemma 3.1

For any random variable $X\sim P$, define $\sigma_k(X, \eta) = -c_k(\rho) \mathbb{E}_P[(\eta - X)_+^{k_*}]^{\frac{1}{k_*}} + \eta$ with $k_* = \frac{k}{k-1}$ and $c_k(\rho) = (1+k(k-1)\rho)^{\frac{1}{k}}$. Then

Figures (8)

Figure 1: The Cliffwalking environment and the learned policies for different $\rho$'s.
Figure 2: Averaged return and steps with 100 random seeds in the perturbed environments. $\rho=0$ corresponds to the non-robust $Q$-learning. $R$ denotes the $R$-contamination ambiguity set.
Figure 3: The training curves in the Cliffwalking environment. Each curve is averaged over 100 random seeds and shaded by their standard deviations. The dashed line is the optimal robust value with corresponding $k$ and $\rho$.
Figure 4: Sample complexity comparisons in Cliffwalking environment with Liu's and Model-based algorithms. Each curve is averaged over 100 random seeds and shaded by their standard deviations.
Figure 5: The return in the CartPole and LunarLander environment. Each curve is averaged over 100 random seeds and shaded by their standard deviations. AP: Action Perturbation; FMP: Force Mag Perturbation; EPP: Engines Power Perturbation.
...and 3 more figures

Theorems & Definitions (13)

Lemma 3.1: duchi2021learning
Lemma 3.2: Sub-Gradient of the $\sigma_k$ dual function
Theorem 3.3
Lemma 2.2: Discrete Gronwall inequality
Lemma 2.3: Gronwall inequality
Lemma 2.8
Lemma 2.17
proof
Theorem 2.18
proof
...and 3 more

Single-Trajectory Distributionally Robust Reinforcement Learning

TL;DR

Abstract

Single-Trajectory Distributionally Robust Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)