A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Heyang Zhao; Jiafan He; Quanquan Gu

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Heyang Zhao, Jiafan He, Quanquan Gu

TL;DR

The paper tackles reinforcement learning with general function approximation, addressing sample efficiency and deployment concerns. It introduces MQL-UCB, a framework combining rare-policy switching, variance-weighted regression, and a monotonic value-function structure to achieve near-minimax regret while keeping policy updates sparse. The main theoretical contributions are regret guarantees that scale with the generalized eluder-dimension and a switching-cost bound that matches known lower bounds, with a linear-MDP specialization yielding optimal rates. The work substantiates that Markov policies with general function classes can be both statistically efficient and deployment-friendly, paving the way for practical RL in nonlinear settings.

Abstract

The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

TL;DR

Abstract

when

is sufficiently large and near-optimal policy switching cost of

, with

being the eluder dimension of the function class,

being the planning horizon, and

being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

Paper Structure (49 sections, 34 theorems, 167 equations, 1 table)

This paper contains 49 sections, 34 theorems, 167 equations, 1 table.

Introduction
Notation.
Preliminaries
Time-Inhomogeneous Episodic MDP
Function Classes and Covering Numbers
Algorithm and Key Techniques
Rare Policy Switching
Weighted Regression
Variance Estimator
Monotonic Value Function
Main Results
Conclusion and Future Work
Additional Related Work
RL with Linear Function Approximation
RL with General Function Approximation
...and 34 more sections

Key Result

Theorem 4.1

Suppose Assumption assumption:complete holds for function classes $\mathcal{F} := \{\mathcal{F}_h\}_{h = 1}^H$ and Definition def:ged holds with $\lambda = 1$. If we set $\alpha = 1 / \sqrt{KH}$, $\epsilon = (KLH)^{-1}$, and set $\widehat{\beta}_k^2 = \widecheck \beta_k^2 := O(\log\frac{2k^2 \left(2 where $\mathrm{Var}_K := \sum_{k = 1}^K \sum_{h = 1}^H [\mathbb{V}_h V_{h + 1}^{\pi^k}] (s_h^k, a_h

Theorems & Definitions (45)

Definition 2.1
Remark 2.3
Definition 2.4: Generalized Eluder dimension, agarwal2022vo
Remark 2.5
Definition 2.6: Bonus oracle $\bar{D}_\mathcal{F}^2$
Remark 2.7
Definition 2.8: Covering numbers of function classes
Remark 2.9
Theorem 4.1
Corollary 4.2
...and 35 more

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

TL;DR

Abstract

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (45)