Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

Junkai Zhang; Weitong Zhang; Quanquan Gu

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

Junkai Zhang, Weitong Zhang, Quanquan Gu

TL;DR

This work tackles reward-free reinforcement learning with linear function approximation by introducing HF-UCRL-RFE++, a horizon-free algorithm for linear mixture MDPs. It combines exploration-guided pseudo rewards, high-order moment estimation (HOME), and a high-confidence set to learn a near-optimal transition model from exploration samples, enabling planning to be performed with any given reward function. The main results establish a horizon-free upper bound of ${\tilde{O}}(d^2 \varepsilon^{-2})$ samples for exploration (and ${\tilde{O}}(d^2 \varepsilon^{-2})$ planning accuracy) and a matching ${\Omega}(d^2 \varepsilon^{-2})$ lower bound, with a variant ${\tilde{O}}(H^2 d^2 \varepsilon^{-2})$ when rewards are scaled to sum to $H$. This demonstrates optimality up to logarithmic factors and shows that horizon length $H$ has only polylogarithmic impact in horizon-free reward-free exploration for linear mixture MDPs.

Abstract

We study reward-free reinforcement learning (RL) with linear function approximation, where the agent works in two phases: (1) in the exploration phase, the agent interacts with the environment but cannot access the reward; and (2) in the planning phase, the agent is given a reward function and is expected to find a near-optimal policy based on samples collected in the exploration phase. The sample complexities of existing reward-free algorithms have a polynomial dependence on the planning horizon, which makes them intractable for long planning horizon RL problems. In this paper, we propose a new reward-free algorithm for learning linear mixture Markov decision processes (MDPs), where the transition probability can be parameterized as a linear combination of known feature mappings. At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward and a high-order moment estimator for the aleatoric and epistemic uncertainties. When the total reward is bounded by $1$, we show that our algorithm only needs to explore $\tilde O( d^2\varepsilon^{-2})$ episodes to find an $\varepsilon$-optimal policy, where $d$ is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is "horizon-free". In addition, we provide an $Ω(d^2\varepsilon^{-2})$ sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

TL;DR

samples for exploration (and

planning accuracy) and a matching

lower bound, with a variant

when rewards are scaled to sum to

. This demonstrates optimality up to logarithmic factors and shows that horizon length

has only polylogarithmic impact in horizon-free reward-free exploration for linear mixture MDPs.

Abstract

, we show that our algorithm only needs to explore

episodes to find an

-optimal policy, where

is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is "horizon-free". In addition, we provide an

sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.

Paper Structure (34 sections, 28 theorems, 108 equations, 1 figure, 2 tables)

This paper contains 34 sections, 28 theorems, 108 equations, 1 figure, 2 tables.

Introduction
Notation
Related Work
RL with Linear Function Approximation
Preliminaries
Reward-free RL
Algorithms
Exploration-driven Pseudo Value Function
High-order Moment Estimation
High Confidence Set
Planning Phase
Computational Complexity of HF-UCRL-RFE++
Main Results
Upper Bounds
Lower Bounds
...and 19 more sections

Key Result

Theorem 5.1

For Algorithm alg:exp, set $M = \log(7KH)/\log(2)$, $\alpha = H^{-1/2}$, $\gamma = d^{-1/4}$, $\lambda = d/B^2$, $\{\beta_k\}_{k\ge1}$ as and denote $\beta=\beta_K$, where $\eta = \log(1 + kH/(\alpha^2 d\lambda))$ and $\tau = \log(32(\log(\gamma^2/\alpha)+1)k^2H^2/\delta)$. Then, for any $0<\delta<1$, we have with probability at least $1-\delta$, after collecting $K$ episodes of samples, algorith

Figures (1)

Figure 1: The transition kernel of the hard-to-learn linear mixture MDPs.

Theorems & Definitions (33)

Definition 3.1: Linear Mixture MDPs, jia2020modelayoub2020modelzhou2020provably
Definition 3.3
Theorem 5.1
Corollary 5.2
Remark 5.3
Corollary 5.4
Remark 5.5
Theorem 5.6
Remark 5.7
Corollary 5.8
...and 23 more

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

TL;DR

Abstract

Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (33)