Table of Contents
Fetching ...

Auto-exploration for online reinforcement learning

Caleb Ju, Guanghui Lan

TL;DR

The paper tackles the exploration-exploitation trade-off in online reinforcement learning by proposing auto-exploration methods that are parameter-free and yield algorithm-independent $O(D_{expl}(\delta)\epsilon^{-2})$ sample complexity under a structural assumption of an exploring policy.Two variants are developed: a tabular method leveraging implicit exploration and a linear-function-approximation method requiring explicit exploration. These rely on innovations such as dynamic mixing time, discounted state visitation distribution sampling, a robust gradient estimator, and an advantage-gap based certificate to certify convergence.The work provides both high-probability, last-iterate convergence guarantees for stochastic policy mirror descent and detailed, constructive procedures (including CTD and Monte Carlo estimators) to enable practical, parameter-free auto-exploration across settings.These results offer fresh theoretical insights into the necessity of explicit exploration in function approximation and pave the way for scalable, adaptive RL algorithms with provable guarantees in online, finite-state-action environments.

Abstract

The exploration-exploitation dilemma in reinforcement learning (RL) is a fundamental challenge to efficient RL algorithms. Existing algorithms for finite state and action discounted RL problems address this by assuming sufficient exploration over both state and action spaces. However, this yields non-implementable algorithms and sub-optimal performance. To resolve these limitations, we introduce a new class of methods with auto-exploration, or methods that automatically explore both state and action spaces in a parameter-free way, i.e.,~without a priori knowledge of problem-dependent parameters. We present two variants: one for the tabular setting and one for linear function approximation. Under algorithm-independent assumptions on the existence of an exploring optimal policy, both methods attain $O(ε^{-2})$ sample complexity to solve to $ε$ error. Crucially, these complexities are novel since they are void of algorithm-dependent parameters seen in prior works, which may be arbitrarily large. The methods are also simple to implement because they are parameter-free and do not directly estimate the unknown parameters. These feats are achieved by new algorithmic innovations for RL, including a dynamic mixing time, a discounted state distribution for sampling, a simple robust gradient estimator, and a recent advantage gap function to certify convergence.

Auto-exploration for online reinforcement learning

TL;DR

The paper tackles the exploration-exploitation trade-off in online reinforcement learning by proposing auto-exploration methods that are parameter-free and yield algorithm-independent $O(D_{expl}(\delta)\epsilon^{-2})$ sample complexity under a structural assumption of an exploring policy.Two variants are developed: a tabular method leveraging implicit exploration and a linear-function-approximation method requiring explicit exploration. These rely on innovations such as dynamic mixing time, discounted state visitation distribution sampling, a robust gradient estimator, and an advantage-gap based certificate to certify convergence.The work provides both high-probability, last-iterate convergence guarantees for stochastic policy mirror descent and detailed, constructive procedures (including CTD and Monte Carlo estimators) to enable practical, parameter-free auto-exploration across settings.These results offer fresh theoretical insights into the necessity of explicit exploration in function approximation and pave the way for scalable, adaptive RL algorithms with provable guarantees in online, finite-state-action environments.

Abstract

The exploration-exploitation dilemma in reinforcement learning (RL) is a fundamental challenge to efficient RL algorithms. Existing algorithms for finite state and action discounted RL problems address this by assuming sufficient exploration over both state and action spaces. However, this yields non-implementable algorithms and sub-optimal performance. To resolve these limitations, we introduce a new class of methods with auto-exploration, or methods that automatically explore both state and action spaces in a parameter-free way, i.e.,~without a priori knowledge of problem-dependent parameters. We present two variants: one for the tabular setting and one for linear function approximation. Under algorithm-independent assumptions on the existence of an exploring optimal policy, both methods attain sample complexity to solve to error. Crucially, these complexities are novel since they are void of algorithm-dependent parameters seen in prior works, which may be arbitrarily large. The methods are also simple to implement because they are parameter-free and do not directly estimate the unknown parameters. These feats are achieved by new algorithmic innovations for RL, including a dynamic mixing time, a discounted state distribution for sampling, a simple robust gradient estimator, and a recent advantage gap function to certify convergence.

Paper Structure

This paper contains 27 sections, 25 theorems, 63 equations, 1 figure, 3 algorithms.

Key Result

Lemma 2.1

Let $\pi$ and $\pi'$ be two feasible policies. Then we have where for a given $p \in \Delta_{\vert \mathcal{A} \vert}$, the advantage function is defined as

Figures (1)

  • Figure 1: Visualization of which policy is used to sample at each time period. Each dot represents the state of the MDP at a certain time, and the dashed line indicates an extended, random time period.

Theorems & Definitions (45)

  • Lemma 2.1
  • Proposition 2.2
  • Lemma 2.3
  • proof
  • Lemma 2.4
  • Proposition 2.5
  • proof : Proof of \ref{['eq:last_iter_opt_gap']} from Proposition \ref{['prop:stronger_gap_converge_with_relaxed']}
  • Lemma 3.1
  • proof
  • Proposition 3.3
  • ...and 35 more