Table of Contents
Fetching ...

Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning

Ha Manh Bui, Felix Parker, Kimia Ghobadi, Anqi Liu

TL;DR

Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation.

Abstract

We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent's interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a real-world COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.

Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning

TL;DR

Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation.

Abstract

We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent's interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a real-world COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.

Paper Structure

This paper contains 31 sections, 5 theorems, 64 equations, 12 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

There exists an absolute constant $c>0$ such that for any $\delta \in (0,1)$, if we choose $b_t=\frac{c}{p(\cdot|s,a;\theta_h)}\sqrt{H^3\iota/t}$, where $t=N_h(s,a)$, then with probability $1-\delta$, the regret of the Algorithm alg:episodic satisfies where $\epsilon$ is the estimator error of $1/p(\cdot|s,a;\theta_h)$.

Figures (12)

  • Figure 1: Robot navigation example, where $\mathbb{P}^k$ represents the set of slippery distributions on the route $\{\mathbb{P}_h\}_{h\in [H]}$ at episode k-th. We consider a more general case of prior works. Specifically, prior works set $\{\mathbb{P}_h\}_{h\in [H]}$ stay the same over episode $k \in [K]$, we consider $\{\mathbb{P}_h\}_{h\in [H]}$ can change across different episodes $k \in [K]$. Similarly, in the infinite-horizon discounted MDP, prior works consider a fixed $\mathbb{P}$ across time steps, while we consider that $\mathbb{P}$ can change.
  • Figure 2: Our shift-aware Q-learning UCB can be more uncertain (i.e., more exploration) if the environment changes because of a low likelihood from $p(\cdot|s,a;\theta)$, and more certain (i.e., more exploitation) if the environment stays the same because of a high likelihood from $p(\cdot|s,a;\theta)$.
  • Figure 3: (a) Cumulative regret on GridWorld, transition noise $\epsilon=0.01$ and $\epsilon=0.2$ before and after the shift at $\Bar{K}=25000$; (b) Cumulative regret on Frozen-Lake, slippery level $\epsilon=0$, $\epsilon=1/2$, and $\epsilon=2/3$ before and after the shift at $\Bar{K}=\{20000, 40000\}$; (c) Cumulative regret on CartPole, the transition noise $\mathcal{N}(0,0.15)$ is added to the velocity state after episode $\Bar{K}=400$. Results are average over $10$ runs. We refer to the computational complexity in Fig. \ref{['fig:complexity']}.
  • Figure 4: Time and space complexity comparison (lower are better) in the $[10]\times[5]$ GridWorld environment.
  • Figure 5: Uncertainty quantification quality of UCB with calibration error measurement.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Remark 3.1
  • Theorem 4.1
  • Remark 4.2
  • Theorem 4.3
  • Remark 4.4
  • Lemma A.1
  • Lemma A.2
  • Lemma A.3
  • proof
  • proof
  • ...and 2 more