Robust Q-Learning under Corrupted Rewards

Sreejeet Maity; Aritra Mitra

Robust Q-Learning under Corrupted Rewards

Sreejeet Maity, Aritra Mitra

TL;DR

This work investigates the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, and develops a novel robust synchronous Q- learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step.

Abstract

Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable $O(\varepsilon)$ error term that scales with the adversarial corruption fraction $\varepsilon$. Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.

Robust Q-Learning under Corrupted Rewards

TL;DR

Abstract

error term that scales with the adversarial corruption fraction

. Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.

Paper Structure (7 sections, 9 theorems, 38 equations, 1 figure, 2 algorithms)

This paper contains 7 sections, 9 theorems, 38 equations, 1 figure, 2 algorithms.

Introduction
Background and Problem Formulation
Motivation
--Robust Q-Learning algorithm
Proof of Theorem \ref{['theorem:theoremmainresult']}
Tackling Unbounded Reward Distributions
Conclusion

Key Result

Theorem 1

Consider the vanilla synchronous Q-learning algorithm in Eq. eqn:syncQ with rewards perturbed based on the Huber attack model described above. If the step-size sequence satisfies $\alpha_t \in (0,1)$ with $\sum_{t=1}^{\infty}\alpha_t = \infty$ and $\sum_{t=1}^{\infty} \alpha_t^2 < \infty$, then with

Figures (1)

Figure 1: The above MDP is constructed with state space $\mathcal{S} =\{1, 2, 3, 4, 5\}$ and action space $\mathcal{A} = \{\texttt{L}, \texttt{R}\}$. When in state $s=1$, taking action $a = L$ leads to a transition to state 2 with probability $p$, and a transition to state 5 with probability $1-p$. Taking action $a = R$ in state $s=1$ leads to symmetric outcomes. In states 2 and 3, regardless of the chosen action, the system remains in states 2 and 3 with probability $p$, and transits to states 4 and 5 with probability $1-p$. States 4 and 5 are absorbing states, indicating that once reached, the system remains in these states indefinitely.

Theorems & Definitions (16)

Theorem 1
proof
Theorem 2
proof
Theorem 3
Theorem 4
Remark 1
Lemma 1
Lemma 2
proof
...and 6 more

Robust Q-Learning under Corrupted Rewards

TL;DR

Abstract

Robust Q-Learning under Corrupted Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (16)