Table of Contents
Fetching ...

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Bhrij Patel, Wesley A. Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Amrit Singh Bedi, Dinesh Manocha

TL;DR

This work tackles the challenge of achieving global convergence in average-reward reinforcement learning without oracle knowledge of the mixing time, a bottleneck for practical policy-gradient methods. It introduces the Multi-level Actor-Critic (MAC) framework with a Multi-level Monte-Carlo (MLMC) gradient estimator, which removes the need for mixing-time or hitting-time information while achieving the best-known mixing-time dependence, $\tilde{O}(\sqrt{\tau_{mix}})$. Building on prior theory, the authors provide a global convergence guarantee for MAC with non-constant AdaGrad stepsizes and demonstrate improved practical performance over PPGAE in a 2D grid-world navigation task. The results suggest MAC offers a scalable, sample-efficient approach to average-reward RL with global optimality guarantees, enabling applications in robotics, traffic, and healthcare where mixing-time estimation is impractical.

Abstract

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{τ_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

TL;DR

This work tackles the challenge of achieving global convergence in average-reward reinforcement learning without oracle knowledge of the mixing time, a bottleneck for practical policy-gradient methods. It introduces the Multi-level Actor-Critic (MAC) framework with a Multi-level Monte-Carlo (MLMC) gradient estimator, which removes the need for mixing-time or hitting-time information while achieving the best-known mixing-time dependence, . Building on prior theory, the authors provide a global convergence guarantee for MAC with non-constant AdaGrad stepsizes and demonstrate improved practical performance over PPGAE in a 2D grid-world navigation task. The results suggest MAC offers a scalable, sample-efficient approach to average-reward RL with global optimality guarantees, enabling applications in robotics, traffic, and healthcare where mixing-time estimation is impractical.

Abstract

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.
Paper Structure (19 sections, 14 theorems, 74 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 19 sections, 14 theorems, 74 equations, 2 figures, 1 table, 3 algorithms.

Key Result

Lemma 2

The gradient of the long-term average reward can be expressed as follows.

Figures (2)

  • Figure 1: Minimum $H$ required for $K = 1$ given a mixing time $\tau_{mix}$. Both $H$ and $\tau_{\max}$ are in terms of number of samples. We set the hitting time to be 10 for this plot.
  • Figure 2: Success Rate in a sparse $15$-by-$15$ grid over 300 training episodes with $200$ samples per episode. For MAC, $T_{max} = 4$ and for PPGAE, $H = 200$ and $N = 1$. Vanilla AC and REINFORCE both have $H = 200$. and $100$ trials for each algorithm. PPGAE, Vanilla AC, and REINFORCE consistently converge to significantly less optimal solutions than MAC.

Theorems & Definitions (16)

  • Definition 1: $\epsilon$-Mixing Time
  • Lemma 2
  • Definition 4
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • Lemma 11
  • Lemma 12
  • Theorem 13
  • Lemma 14
  • ...and 6 more