Table of Contents
Fetching ...

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Haochen Zhang, Zhong Zheng, Lingzhou Xue

TL;DR

This work provides the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ and introduces a concurrent variant that enables efficient parallel exploration across multiple agents and establishes the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation.

Abstract

We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax-optimal worst-case regret bound $\tilde{O}(d\sqrt{H^3K})$, where $d$ is the feature dimension, $H$ is the horizon length, and $K$ is the number of episodes. We bridge this gap by providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improved dependencies on both $d$ and $H$ compared to previous gap-dependent results. Moreover, leveraging the low policy-switching property of LSVI-UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

TL;DR

This work provides the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ and introduces a concurrent variant that enables efficient parallel exploration across multiple agents and establishes the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation.

Abstract

We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax-optimal worst-case regret bound , where is the feature dimension, is the horizon length, and is the number of episodes. We bridge this gap by providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improved dependencies on both and compared to previous gap-dependent results. Moreover, leveraging the low policy-switching property of LSVI-UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.
Paper Structure (16 sections, 21 theorems, 91 equations, 1 table, 2 algorithms)

This paper contains 16 sections, 21 theorems, 91 equations, 1 table, 2 algorithms.

Key Result

Proposition 3.4

For any policy $\pi$, there exist weights $\{\mathbf{w}_h^\pi\}_{h=1}^H$ such that for any state-action-step triple $(s,a,h)\in {\mathcal{S}} \times \mathcal{A} \times[H]$, we have $\mathbb{P}_{s,a,h} V_{h+1}^\pi = \langle \bm{\phi}(s,a), \mathbf{w}_h^\pi\rangle$.

Theorems & Definitions (24)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Proposition 3.4: Proposition 3.3 of he2021logarithmic
  • Theorem 4.1
  • Corollary 4.2
  • Theorem 4.3
  • Lemma 5.1
  • Lemma 5.2: Informal
  • Lemma 5.3
  • ...and 14 more