Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Haochen Zhang; Zhong Zheng; Lingzhou Xue

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Haochen Zhang, Zhong Zheng, Lingzhou Xue

TL;DR

This work provides the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ and introduces a concurrent variant that enables efficient parallel exploration across multiple agents and establishes the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation.

Abstract

We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax-optimal worst-case regret bound $\tilde{O}(d\sqrt{H^3K})$, where $d$ is the feature dimension, $H$ is the horizon length, and $K$ is the number of episodes. We bridge this gap by providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improved dependencies on both $d$ and $H$ compared to previous gap-dependent results. Moreover, leveraging the low policy-switching property of LSVI-UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

TL;DR

Abstract

, where

is the feature dimension,

is the horizon length, and

is the number of episodes. We bridge this gap by providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improved dependencies on both

and

compared to previous gap-dependent results. Moreover, leveraging the low policy-switching property of LSVI-UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.

Paper Structure (16 sections, 21 theorems, 91 equations, 1 table, 2 algorithms)

This paper contains 16 sections, 21 theorems, 91 equations, 1 table, 2 algorithms.

Introduction
Related Work
Preliminaries
Theoretical Guarantee
Algorithm Review
Gap-Dependent Regret Upper Bound
Extension to Concurrent RL
Proof Sketch of Theorem 4.1
Proof Sketch of Lemma 5.2
Conclusion
Auxiliary Lemmas
Probability Events
Properties of Value Function Estimates
Proof of Theorem 4.1
Proof of Corollary 4.2
...and 1 more sections

Key Result

Proposition 3.4

For any policy $\pi$, there exist weights $\{\mathbf{w}_h^\pi\}_{h=1}^H$ such that for any state-action-step triple $(s,a,h)\in {\mathcal{S}} \times \mathcal{A} \times[H]$, we have $\mathbb{P}_{s,a,h} V_{h+1}^\pi = \langle \bm{\phi}(s,a), \mathbf{w}_h^\pi\rangle$.

Theorems & Definitions (24)

Definition 3.1
Definition 3.2
Definition 3.3
Proposition 3.4: Proposition 3.3 of he2021logarithmic
Theorem 4.1
Corollary 4.2
Theorem 4.3
Lemma 5.1
Lemma 5.2: Informal
Lemma 5.3
...and 14 more

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

TL;DR

Abstract

Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (24)