Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs
Junkai Zhang, Weitong Zhang, Quanquan Gu
TL;DR
This work tackles reward-free reinforcement learning with linear function approximation by introducing HF-UCRL-RFE++, a horizon-free algorithm for linear mixture MDPs. It combines exploration-guided pseudo rewards, high-order moment estimation (HOME), and a high-confidence set to learn a near-optimal transition model from exploration samples, enabling planning to be performed with any given reward function. The main results establish a horizon-free upper bound of ${\tilde{O}}(d^2 \varepsilon^{-2})$ samples for exploration (and ${\tilde{O}}(d^2 \varepsilon^{-2})$ planning accuracy) and a matching ${\Omega}(d^2 \varepsilon^{-2})$ lower bound, with a variant ${\tilde{O}}(H^2 d^2 \varepsilon^{-2})$ when rewards are scaled to sum to $H$. This demonstrates optimality up to logarithmic factors and shows that horizon length $H$ has only polylogarithmic impact in horizon-free reward-free exploration for linear mixture MDPs.
Abstract
We study reward-free reinforcement learning (RL) with linear function approximation, where the agent works in two phases: (1) in the exploration phase, the agent interacts with the environment but cannot access the reward; and (2) in the planning phase, the agent is given a reward function and is expected to find a near-optimal policy based on samples collected in the exploration phase. The sample complexities of existing reward-free algorithms have a polynomial dependence on the planning horizon, which makes them intractable for long planning horizon RL problems. In this paper, we propose a new reward-free algorithm for learning linear mixture Markov decision processes (MDPs), where the transition probability can be parameterized as a linear combination of known feature mappings. At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward and a high-order moment estimator for the aleatoric and epistemic uncertainties. When the total reward is bounded by $1$, we show that our algorithm only needs to explore $\tilde O( d^2\varepsilon^{-2})$ episodes to find an $\varepsilon$-optimal policy, where $d$ is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is "horizon-free". In addition, we provide an $Ω(d^2\varepsilon^{-2})$ sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.
