Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Jianliang He; Han Zhong; Zhuoran Yang

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Jianliang He, Han Zhong, Zhuoran Yang

TL;DR

This work advances sample-efficient reinforcement learning for infinite-horizon average-reward MDPs under general function approximation by introducing a unifying complexity metric, AGEC, and a flexible algorithmic framework, LOOP. LOOP combines optimistic planning with a confidence-set construction and a lazy policy-update rule, enabling it to handle both model-based and value-based representations. The authors show a sublinear regret bound $\tilde{\mathcal{O}}(\mathrm{sp}(V^*) \cdot d \sqrt{T\beta})$, where $d$ captures AGEC and log-covering complexity, thereby encompassing many tractable AMDP models including linear, kernel, and linear-Q*/V* variants. This framework provides a cohesive theoretical basis for understanding exploration in AMDPs with general function approximation and points to broad applicability in complex, high-dimensional decision problems.

Abstract

We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with Bellman eluder dimensions. Using AGEC, we prove that LOOP achieves a sublinear $\tilde{\mathcal{O}}(\mathrm{poly}(d, \mathrm{sp}(V^*)) \sqrt{Tβ} )$ regret, where $d$ and $β$ correspond to AGEC and log-covering number of the hypothesis class respectively, $\mathrm{sp}(V^*)$ is the span of the optimal state bias function, $T$ denotes the number of steps, and $\tilde{\mathcal{O}} (\cdot) $ omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

TL;DR

, where

captures AGEC and log-covering complexity, thereby encompassing many tractable AMDP models including linear, kernel, and linear-Q*/V* variants. This framework provides a cohesive theoretical basis for understanding exploration in AMDPs with general function approximation and points to broad applicability in complex, high-dimensional decision problems.

Abstract

regret, where

and

correspond to AGEC and log-covering number of the hypothesis class respectively,

is the span of the optimal state bias function,

denotes the number of steps, and

omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.

Paper Structure (62 sections, 24 theorems, 136 equations, 1 table, 3 algorithms)

This paper contains 62 sections, 24 theorems, 136 equations, 1 table, 3 algorithms.

Introduction
Related Work
Infinite-horizon Average-reward MDPs.
Function Approximation in Finite-horizon MDPs.
Low-Switching Cost Algorithms.
Preliminaries
Notations.
Learning Objective
General Function Approximation
Average-Reward Generalized Eluder Coefficients
Relation with Tractable Complexity Metric
Eluder Dimension
Average-Reward Bellman Eluder (ABE) Dimension
Local-fitted Optimization with Optimism
Proof Overview of Regret Analysis
...and 47 more sections

Key Result

Lemma 3.1

Consider discrepancy function with $\mathcal{P}(f)=f^*$, and the expectation is taken over $s_{t+1}$ from $\mathbb{P}(\cdot|s_t,a_t)$. Let $d_{\rm E}={\rm dim_E}(\mathcal{X}_\mathcal{H},\epsilon)$ be the $\epsilon$-Eluder dimension defined over $\mathcal{X}_\mathcal{H}$, then we have $d_\text{G}\leq2d_{\rm E}\cdot\log T$ and $\k

Theorems & Definitions (42)

Definition 1: Value-based hypothesis
Definition 2: Model-based hypothesis
Definition 3: AGEC
Example 1: Bellman completeness $\subseteq$ Generalized completeness
Definition 4: Point-wise $\epsilon$-independence
Definition 5: Eluder dimension
Lemma 3.1: Low Eluder dim $\subseteq$ Low AGEC
Definition 6: Distributional $\epsilon$-independence
Definition 7: Distributional Eluder dimension
Definition 8: ABE dimension
...and 32 more

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

TL;DR

Abstract

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (42)