Table of Contents
Fetching ...

Horizon-Free Regret for Linear Markov Decision Processes

Zihan Zhang, Jason D. Lee, Yuxin Chen, Simon S. Du

TL;DR

This work addresses horizon-free regret in reinforcement learning under the linear MDP model, where the transition dynamics may be exponentially large or uncountable. It moves beyond transition-model estimation by directly learning the value functions and their confidence sets, enabling horizon-free guarantees in the linear setting. The authors introduce three core techniques—uniform variance bounds, total-variation controls, and doubling-segment horizon partitioning—to handle time-inhomogeneous value functions and variance-aware estimators, culminating in a regret bound of the form $\tilde{O}(\text{poly}(d)\sqrt{K})$ with poly-logarithmic dependence on the horizon $H$. The Horizon-Free Estimator (HF-Estimator) combines weighted least-squares value-function estimation with confidence-sets for the transition and reward parameters (the latter via VOFUL), offering a principled route to horizon-free sample complexity in expressive linear MDPs and shedding light on the statistical efficiency of horizon-independent RL.

Abstract

A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.

Horizon-Free Regret for Linear Markov Decision Processes

TL;DR

This work addresses horizon-free regret in reinforcement learning under the linear MDP model, where the transition dynamics may be exponentially large or uncountable. It moves beyond transition-model estimation by directly learning the value functions and their confidence sets, enabling horizon-free guarantees in the linear setting. The authors introduce three core techniques—uniform variance bounds, total-variation controls, and doubling-segment horizon partitioning—to handle time-inhomogeneous value functions and variance-aware estimators, culminating in a regret bound of the form with poly-logarithmic dependence on the horizon . The Horizon-Free Estimator (HF-Estimator) combines weighted least-squares value-function estimation with confidence-sets for the transition and reward parameters (the latter via VOFUL), offering a principled route to horizon-free sample complexity in expressive linear MDPs and shedding light on the statistical efficiency of horizon-independent RL.

Abstract

A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.
Paper Structure (31 sections, 17 theorems, 86 equations, 3 algorithms)

This paper contains 31 sections, 17 theorems, 86 equations, 3 algorithms.

Key Result

Theorem 1

Choose $\mathtt{Reward-Confidence}$ as VOFUL (see Algorithm alg:voful). For any MDP satisfying the total-bounded reward assumption (Assumption assumr) and linear MDP assumption (see Assumption assuml), then with probability $1-\delta$, the regret of Algorithm alg:main is bounded by $\widetilde{O}(d^

Theorems & Definitions (28)

  • Theorem 1
  • Example 1
  • Lemma 1: Theorem 4.3 in zhou2022computationally
  • Lemma 2: chen2021implicit
  • Lemma 3
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • ...and 18 more