Table of Contents
Fetching ...

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Jianliang He, Han Zhong, Zhuoran Yang

TL;DR

This work advances sample-efficient reinforcement learning for infinite-horizon average-reward MDPs under general function approximation by introducing a unifying complexity metric, AGEC, and a flexible algorithmic framework, LOOP. LOOP combines optimistic planning with a confidence-set construction and a lazy policy-update rule, enabling it to handle both model-based and value-based representations. The authors show a sublinear regret bound $\tilde{\mathcal{O}}(\mathrm{sp}(V^*) \cdot d \sqrt{T\beta})$, where $d$ captures AGEC and log-covering complexity, thereby encompassing many tractable AMDP models including linear, kernel, and linear-Q*/V* variants. This framework provides a cohesive theoretical basis for understanding exploration in AMDPs with general function approximation and points to broad applicability in complex, high-dimensional decision problems.

Abstract

We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with Bellman eluder dimensions. Using AGEC, we prove that LOOP achieves a sublinear $\tilde{\mathcal{O}}(\mathrm{poly}(d, \mathrm{sp}(V^*)) \sqrt{Tβ} )$ regret, where $d$ and $β$ correspond to AGEC and log-covering number of the hypothesis class respectively, $\mathrm{sp}(V^*)$ is the span of the optimal state bias function, $T$ denotes the number of steps, and $\tilde{\mathcal{O}} (\cdot) $ omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

TL;DR

This work advances sample-efficient reinforcement learning for infinite-horizon average-reward MDPs under general function approximation by introducing a unifying complexity metric, AGEC, and a flexible algorithmic framework, LOOP. LOOP combines optimistic planning with a confidence-set construction and a lazy policy-update rule, enabling it to handle both model-based and value-based representations. The authors show a sublinear regret bound , where captures AGEC and log-covering complexity, thereby encompassing many tractable AMDP models including linear, kernel, and linear-Q*/V* variants. This framework provides a cohesive theoretical basis for understanding exploration in AMDPs with general function approximation and points to broad applicability in complex, high-dimensional decision problems.

Abstract

We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with Bellman eluder dimensions. Using AGEC, we prove that LOOP achieves a sublinear regret, where and correspond to AGEC and log-covering number of the hypothesis class respectively, is the span of the optimal state bias function, denotes the number of steps, and omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.
Paper Structure (62 sections, 24 theorems, 136 equations, 1 table, 3 algorithms)

This paper contains 62 sections, 24 theorems, 136 equations, 1 table, 3 algorithms.

Key Result

Lemma 3.1

Consider discrepancy function with $\mathcal{P}(f)=f^*$, and the expectation is taken over $s_{t+1}$ from $\mathbb{P}(\cdot|s_t,a_t)$. Let $d_{\rm E}={\rm dim_E}(\mathcal{X}_\mathcal{H},\epsilon)$ be the $\epsilon$-Eluder dimension defined over $\mathcal{X}_\mathcal{H}$, then we have $d_\text{G}\leq2d_{\rm E}\cdot\log T$ and $\k

Theorems & Definitions (42)

  • Definition 1: Value-based hypothesis
  • Definition 2: Model-based hypothesis
  • Definition 3: AGEC
  • Example 1: Bellman completeness $\subseteq$ Generalized completeness
  • Definition 4: Point-wise $\epsilon$-independence
  • Definition 5: Eluder dimension
  • Lemma 3.1: Low Eluder dim $\subseteq$ Low AGEC
  • Definition 6: Distributional $\epsilon$-independence
  • Definition 7: Distributional Eluder dimension
  • Definition 8: ABE dimension
  • ...and 32 more