Table of Contents
Fetching ...

Beyond Optimism: Exploration With Partially Observable Rewards

Simone Parisi, Alireza Kazemipour, Michael Bowling

TL;DR

This paper presents a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable, and proposes a collection of tabular environments for benchmarking exploration in RL.

Abstract

Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e.g., situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.

Beyond Optimism: Exploration With Partially Observable Rewards

TL;DR

This paper presents a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable, and proposes a collection of tabular environments for benchmarking exploration in RL.

Abstract

Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e.g., situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.
Paper Structure (20 sections, 2 theorems, 10 equations, 14 figures, 1 table, 5 algorithms)

This paper contains 20 sections, 2 theorems, 10 equations, 14 figures, 1 table, 5 algorithms.

Key Result

Theorem 1

If the goal-relative diameter of $\rho$ is bounded by $\bar{D}$ then Algorithm alg:explore_exploit is a GLIE policy.

Figures (14)

  • Figure 1: When optimism is not enough. The agent starts in the second leftmost cell of a corridor-like gridworld and can move $\texttt{LEFT}$ or $\texttt{RIGHT}$. The coin in the leftmost cell gives a small reward, the one in the rightmost cell gives a large reward, while all other cells give zero rewards. In \ref{['fig:walk_unobs']}, pushing the button is costly (negative reward) but is needed to observe coin rewards --- if the agent collects a coin without pushing it first, the environment returns the reward but the agent cannot observe it. Given a sufficiently large discount factor, the optimal policy is to collect the large coin without pushing the button. Consider a purely optimistic agent, i.e., an agent that selects action greedily with respect to their value estimate, and these estimates are initialized to the same optimistically high value even2001convergence. In \ref{['fig:walk_obs']} all rewards are always observable, therefore: the agent visits a cell; its optimistic estimate decreases according to the reward; the agent tries another cell. At the end, it will visit all states and learn the optimal policy. In \ref{['fig:walk_unobs']}, however, when the agent visits a coin cell without pushing the button first, it will not observe the reward and will not validate or prove wrong its optimistic estimate. Thus, the estimate for both coin-cells will stay equally optimistic, and between the two the agent will prefer to go to the leftmost cell because it is closer to the start. Optimistic model-based algorithms auer2007logarithmicjaksch2010near have the same problem. If all value estimates are optimistic and the agent knows that pushing the button is costly, between (1) push the button and collect the right coin, (2) do not push the button and collect the right coin, and (3) do not push the button and collect the left coin, the agent will always chose the third: the optimistic value is the same, but it does not incur in the button cost and the left coin is closer. Yet, this gives no information to the agent, because it cannot observe the reward and therefore cannot update its optimistic value. Note that the agent could replace the unobserved reward with an estimate from a model updated as rewards are observed. But how can this model be accurate if the agent cannot explore properly and observe rewards in the first place?
  • Figure 2: The MDP framework.
  • Figure 3: The Mon-MDP framework.
  • Figure 4: Environments. The goal is to collect the large coin ($r^\textsc{e}_t = 1$) instead of small "distracting" coins ($r^\textsc{e}_t = 0.1$). In Hazard, the agent must avoid quicksand (it prevents any movement) and toxic clouds ($r^\textsc{e}_t = -10$). In One-Way, the agent must walk over toxic clouds ($r^\textsc{e}_t = -0.1$) to get the large coin. In River Swim, the stochastic transition pushes the agent to the left. More details in Appendix \ref{['app:envs']}.
  • Figure 5: Episode return ${{\footnotesize\sum}(r^\textsc{e}_t + r^\textsc{m}_t)}$ of greedy policies averaged over 100 seeds (shades denote 95% confidence interval). Our exploration clearly outperforms all baselines, as it is the only one learning in all Mon-MDPs. Indeed, while all baselines learn relatively quickly when rewards are fully observable (first column), their performance drastically decreases with rewards partial observability.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 1: singh2000convergence
  • Definition 2
  • Theorem 1
  • proof
  • Corollary 1