Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path
Qiwei Di, Jiafan He, Dongruo Zhou, Quanquan Gu
TL;DR
This work tackles learning in stochastic shortest path problems with linear mixture transitions by introducing LEVIS$^{++}$, a computationally efficient algorithm that uses extended value iteration (DEVI) together with a variance-aware, high-order moment–based confidence framework (HOME). The method eliminates the need for a strictly positive cost or known horizon, achieving a regret of $\tilde{O}(d B_* \sqrt{K})$, which nearly matches the known lower bound $\Omega(d B_* \sqrt{K})$ and thus is nearly minimax optimal. The key innovations are variance-aware weights that combine estimated variance with uncertainty terms, and a multi-level moment estimator that propagates high-order information through weighted regressions. Together, these techniques enable horizon-free regret bounds in linear mixture SSPs, with practical implications for scalable RL in large state-action spaces.
Abstract
We study the Stochastic Shortest Path (SSP) problem with a linear mixture transition kernel, where an agent repeatedly interacts with a stochastic environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the cost function or an upper bound of the expected length for the optimal policy. In this paper, we propose a new algorithm to eliminate these restrictive assumptions. Our algorithm is based on extended value iteration with a fine-grained variance-aware confidence set, where the variance is estimated recursively from high-order moments. Our algorithm achieves an $\tilde{\mathcal O}(dB_*\sqrt{K})$ regret bound, where $d$ is the dimension of the feature mapping in the linear transition kernel, $B_*$ is the upper bound of the total cumulative cost for the optimal policy, and $K$ is the number of episodes. Our regret upper bound matches the $Ω(dB_*\sqrt{K})$ lower bound of linear mixture SSPs in Min et al. (2022), which suggests that our algorithm is nearly minimax optimal.
