EVAL: EigenVector-based Average-reward Learning

Jacob Adamczyk; Volodymyr Makarenko; Stas Tiomkin; Rahul V. Kulkarni

EVAL: EigenVector-based Average-reward Learning

Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, Rahul V. Kulkarni

TL;DR

This formulation reveals new theoretical insights into the relationship between different objectives used in RL, and combines the algorithm with a posterior policy iteration scheme, showing how this approach can also solve the average-reward RL problem without entropy-regularization.

Abstract

In reinforcement learning, two objective functions have been developed extensively in the literature: discounted and averaged rewards. The generalization to an entropy-regularized setting has led to improved robustness and exploration for both of these objectives. Recently, the entropy-regularized average-reward problem was addressed using tools from large deviation theory in the tabular setting. This method has the advantage of linearity, providing access to both the optimal policy and average reward-rate through properties of a single matrix. In this paper, we extend that framework to more general settings by developing approaches based on function approximation by neural networks. This formulation reveals new theoretical insights into the relationship between different objectives used in RL. Additionally, we combine our algorithm with a posterior policy iteration scheme, showing how our approach can also solve the average-reward RL problem without entropy-regularization. Using classic control benchmarks, we experimentally find that our method compares favorably with other algorithms in terms of stability and rate of convergence.

EVAL: EigenVector-based Average-reward Learning

TL;DR

Abstract

Paper Structure (19 sections, 1 theorem, 18 equations, 7 figures, 4 tables, 3 algorithms)

This paper contains 19 sections, 1 theorem, 18 equations, 7 figures, 4 tables, 3 algorithms.

Introduction
Preliminaries
Prior Work
Solution Method
Proposed Algorithms
Solution to ERAR-MDP
Posterior Policy Iteration
Experiments
Limitations and Future Work
Conclusion
Acknowledgements
Theory
Experimental Details
Implementation
Choice of Architecture
...and 4 more sections

Key Result

Lemma 1

The entropy-regularized average reward rate is given by $\theta$, and the optimal differential value function and optimal policy are given by:

Figures (7)

Figure 1: Performance of discounted soft Q-Learning (SQL) as a function of discount factor compared with solution using the proposed average-reward method (EVAL). Note that the average-reward solution (blue line) recovers the discounted solution as $\gamma~\to~1$. For the discounted objective, computational cost grows as $(1~-~\gamma)~^{-1}$ and choosing a low discount factor to reduce computational cost can result in lower rewards. The boundary of low reward, low complexity and high reward, high complexity is demarcated by the discount factor derived from the spectral gap of the associated tilted matrix (cf. "Preliminaries"). Insets: state-occupation distributions following the SQL optimal policies at $\gamma=0.87, 0.93$. Green dot denotes initial position of agent, and star denotes the goal. The agent can move in any of the cardinal directions. Since we use entropy-regularization, the optimal policy is stochastic, thus yielding a variance in the return (plotted with a shaded interval for each method). Inverse temperature $\beta=15$.
Figure 2: Classic control benchmark comparing soft Q-learning (SQL), deep Q network (DQN) and our two proposed methods (EVAL, EVAL+PPI). We find EVAL and EVAL+PPI to generally obtain higher reward with less variance than SQL or DQN.
Figure 3: As a demonstration of the usefulness of EVAL+PPI, we consider a modified version of CartPole which represents a continuing task. After training for 5000 steps (in the standard CartPole-v1 environment with a maximum episode length of 500), we compare the evaluation performance of SQL with EVAL+PPI. Specifically, we set the time-limit of the environment much higher: to 100,000 steps. We find that EVAL+PPI consistently reaches the maximum number of steps while SQL only rarely achieves similarly high reward. We find that EVAL+PPI can continue episodes for at least 10 billion steps (as of submission).
Figure 4: The final two columns show the additionally tuned (with all others held fixed) hyperparameters specific to PPI.
Figure 5: $\varepsilon_{\textrm{frac}}$ denotes the exploration fraction over which to decay $\varepsilon=1.0$ to $\varepsilon=\varepsilon_{\textrm{final}}$. A training frequency of $-1$ indicates that training of the $Q$ networks occurs only after the end of each rollout episode. The provided optimal values for $\tau=1.0$ and hidden dimension of 256 throughout all environments.
...and 2 more figures

Theorems & Definitions (1)

Lemma 1: PRR

EVAL: EigenVector-based Average-reward Learning

TL;DR

Abstract

EVAL: EigenVector-based Average-reward Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (1)