A Computationally Efficient Algorithm for Infinite-Horizon Average-Reward Linear MDPs
Kihyuk Hong, Ambuj Tewari
TL;DR
The paper tackles learning in infinite-horizon average-reward RL for linear MDPs by addressing the computational bottleneck of clipping-based value iteration, which normally requires minimizing over the entire state space. It introduces Efficient Clipping, which confines clipping thresholds to states visited by the agent, and Deviation-Controlled Value Iteration to stabilize value-function sequences across varying thresholds. The resulting algorithm, γ-DC-LSCVI-UCB+, achieves a regret bound of $R_T = ilde{O}( ext{sp}(v^*) \, oot 3 obreak{d^3} \, oot 1 obreak{T})$ (specifically $R_T ilde{} = ilde{O}( ext{sp}(v^*) \, ext{sqrt}(d^3 T))$ in the main result) with computational complexity polynomial in $T$, $d$, and $A$ that is independent of the state space size $| ext{S}|$. The analysis combines discounted-approximation insights, clipped-VI techniques, and deviation control to obtain tight regret bounds while maintaining scalability to large or infinite state spaces. This work thus enables practical, near-optimal learning in average-reward linear MDPs and suggests pathways to broader function-approximation settings and variance-aware improvements.
Abstract
We study reinforcement learning in infinite-horizon average-reward settings with linear MDPs. Previous work addresses this problem by approximating the average-reward setting by discounted setting and employing a value iteration-based algorithm that uses clipping to constrain the span of the value function for improved statistical efficiency. However, the clipping procedure requires computing the minimum of the value function over the entire state space, which is prohibitive since the state space in linear MDP setting can be large or even infinite. In this paper, we introduce a value iteration method with efficient clipping operation that only requires computing the minimum of value functions over the set of states visited by the algorithm. Our algorithm enjoys the same regret bound as the previous work while being computationally efficient, with computational complexity that is independent of the size of the state space.
