Satisficing Exploration for Deep Reinforcement Learning

Dilip Arumugam; Saurabh Kumar; Ramki Gummadi; Benjamin Van Roy

Satisficing Exploration for Deep Reinforcement Learning

Dilip Arumugam, Saurabh Kumar, Ramki Gummadi, Benjamin Van Roy

TL;DR

Large-scale RL makes exhaustive exploration to recover $Q^\star$ intractable. The paper proposes a model-free Bayesian RL framework that maintains a posterior over the optimal value function $Q^\star$ and uses rate-distortion theory, with distortion $d_s(Q^\star,\widetilde{a})=(\max_a Q^\star(s,a)-Q^\star(s,\widetilde{a}))^2$, to compute a per-timestep target action via Blahut-Arimoto. This BA-RVF method generalizes Randomized Value Functions to deep settings without planning. Empirical results on MiniGrid and RiverSwim show that it yields a spectrum of satisficing strategies and can outperform non-information-theoretic baselines while still enabling efficient recovery of optimal behavior when feasible.

Abstract

A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisite exploration for identifying an optimal policy. Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions, obtained through lossy compression. Notably, such agents may employ fundamentally different exploratory decisions to learn satisficing behaviors more efficiently than optimal ones that are more data intensive. While supported by a rigorous corroborating theory, the underlying algorithm relies on model-based planning, drastically limiting the compatibility of these ideas with function approximation and high-dimensional observations. In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies. We provide simple yet illustrative experiments that demonstrate how our algorithm enables deep reinforcement-learning agents to achieve satisficing behaviors. In keeping with previous work on this setting for multi-armed bandits, we additionally find that our algorithm is capable of synthesizing optimal behaviors, when feasible, more efficiently than its non-information-theoretic counterpart.

Satisficing Exploration for Deep Reinforcement Learning

TL;DR

Large-scale RL makes exhaustive exploration to recover

intractable. The paper proposes a model-free Bayesian RL framework that maintains a posterior over the optimal value function

and uses rate-distortion theory, with distortion

, to compute a per-timestep target action via Blahut-Arimoto. This BA-RVF method generalizes Randomized Value Functions to deep settings without planning. Empirical results on MiniGrid and RiverSwim show that it yields a spectrum of satisficing strategies and can outperform non-information-theoretic baselines while still enabling efficient recovery of optimal behavior when feasible.

Abstract

Paper Structure (12 sections, 4 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Problem Formulation
Satisficing with Randomized Value Functions
Randomized Value Functions
Blahut-Arimoto Randomized Value Functions
Experiments
Conclusion
Preliminaries
Algorithms
Related Work
Additional Minigrid Experiment Details
Compute Details

Figures (3)

Figure 1: (Top) MiniGrid environments used in our empirical evaluation of Blahut-Arimoto RVF. An observation is a partial image of the whole grid indicated by the shaded region. Black tiles represent empty squares, gray tiles represent walls, and colored tiles represent goal states. The agent begins in the upper left corner and an episode terminates when the agent either reaches a goal state or takes $100$ steps. (Bottom) Learning curves of DQN, RVF, and Blahut-Arimoto RVF.
Figure 2: The RiverSwim MDP of strehl2008analysis as studied by osband2013more.
Figure 3: Learning curves for BA-RVF varying $\beta$ valuse in the ConfluenceSwim environment.

Satisficing Exploration for Deep Reinforcement Learning

TL;DR

Abstract

Satisficing Exploration for Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)