Table of Contents
Fetching ...

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

Heyang Zhao, Jiafan He, Quanquan Gu

TL;DR

The paper tackles reinforcement learning with general function approximation, addressing sample efficiency and deployment concerns. It introduces MQL-UCB, a framework combining rare-policy switching, variance-weighted regression, and a monotonic value-function structure to achieve near-minimax regret while keeping policy updates sparse. The main theoretical contributions are regret guarantees that scale with the generalized eluder-dimension and a switching-cost bound that matches known lower bounds, with a linear-MDP specialization yielding optimal rates. The work substantiates that Markov policies with general function classes can be both statistically efficient and deployment-friendly, paving the way for practical RL in nonlinear settings.

Abstract

The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

TL;DR

The paper tackles reinforcement learning with general function approximation, addressing sample efficiency and deployment concerns. It introduces MQL-UCB, a framework combining rare-policy switching, variance-weighted regression, and a monotonic value-function structure to achieve near-minimax regret while keeping policy updates sparse. The main theoretical contributions are regret guarantees that scale with the generalized eluder-dimension and a switching-cost bound that matches known lower bounds, with a linear-MDP specialization yielding optimal rates. The work substantiates that Markov policies with general function classes can be both statistically efficient and deployment-friendly, paving the way for practical RL in nonlinear settings.

Abstract

The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of when is sufficiently large and near-optimal policy switching cost of , with being the eluder dimension of the function class, being the planning horizon, and being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
Paper Structure (49 sections, 34 theorems, 167 equations, 1 table)

This paper contains 49 sections, 34 theorems, 167 equations, 1 table.

Key Result

Theorem 4.1

Suppose Assumption assumption:complete holds for function classes $\mathcal{F} := \{\mathcal{F}_h\}_{h = 1}^H$ and Definition def:ged holds with $\lambda = 1$. If we set $\alpha = 1 / \sqrt{KH}$, $\epsilon = (KLH)^{-1}$, and set $\widehat{\beta}_k^2 = \widecheck \beta_k^2 := O(\log\frac{2k^2 \left(2 where $\mathrm{Var}_K := \sum_{k = 1}^K \sum_{h = 1}^H [\mathbb{V}_h V_{h + 1}^{\pi^k}] (s_h^k, a_h

Theorems & Definitions (45)

  • Definition 2.1
  • Remark 2.3
  • Definition 2.4: Generalized Eluder dimension, agarwal2022vo
  • Remark 2.5
  • Definition 2.6: Bonus oracle $\bar{D}_\mathcal{F}^2$
  • Remark 2.7
  • Definition 2.8: Covering numbers of function classes
  • Remark 2.9
  • Theorem 4.1
  • Corollary 4.2
  • ...and 35 more