Table of Contents
Fetching ...

Single-Trajectory Distributionally Robust Reinforcement Learning

Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jiheng Zhang, Zhengyuan Zhou

TL;DR

This paper designs a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ), and delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment.

Abstract

To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.

Single-Trajectory Distributionally Robust Reinforcement Learning

TL;DR

This paper designs a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ), and delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment.

Abstract

To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.
Paper Structure (25 sections, 9 theorems, 72 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 25 sections, 9 theorems, 72 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Lemma 3.1

For any random variable $X\sim P$, define $\sigma_k(X, \eta) = -c_k(\rho) \mathbb{E}_P[(\eta - X)_+^{k_*}]^{\frac{1}{k_*}} + \eta$ with $k_* = \frac{k}{k-1}$ and $c_k(\rho) = (1+k(k-1)\rho)^{\frac{1}{k}}$. Then

Figures (8)

  • Figure 1: The Cliffwalking environment and the learned policies for different $\rho$'s.
  • Figure 2: Averaged return and steps with 100 random seeds in the perturbed environments. $\rho=0$ corresponds to the non-robust $Q$-learning. $R$ denotes the $R$-contamination ambiguity set.
  • Figure 3: The training curves in the Cliffwalking environment. Each curve is averaged over 100 random seeds and shaded by their standard deviations. The dashed line is the optimal robust value with corresponding $k$ and $\rho$.
  • Figure 4: Sample complexity comparisons in Cliffwalking environment with Liu's and Model-based algorithms. Each curve is averaged over 100 random seeds and shaded by their standard deviations.
  • Figure 5: The return in the CartPole and LunarLander environment. Each curve is averaged over 100 random seeds and shaded by their standard deviations. AP: Action Perturbation; FMP: Force Mag Perturbation; EPP: Engines Power Perturbation.
  • ...and 3 more figures

Theorems & Definitions (13)

  • Lemma 3.1: duchi2021learning
  • Lemma 3.2: Sub-Gradient of the $\sigma_k$ dual function
  • Theorem 3.3
  • Lemma 2.2: Discrete Gronwall inequality
  • Lemma 2.3: Gronwall inequality
  • Lemma 2.8
  • Lemma 2.17
  • proof
  • Theorem 2.18
  • proof
  • ...and 3 more