Rate-Optimal Policy Optimization for Linear Markov Decision Processes
Uri Sherman, Alon Cohen, Tomer Koren, Yishay Mansour
TL;DR
The paper addresses regret minimization in online episodic linear MDPs with function approximation by proposing an optimistic policy optimization algorithm that works in both adversarial full-information and stochastic bandit feedback settings. A reward-free warmup and restricted value-function design control policy-class capacity and enable tight uniform concentration, leading to a rate of $\tilde{O}(\sqrt{K})$ regret. The main theoretical contribution is a concrete regret bound of $\tilde{O}( d^2 H^{7/2} \sqrt{K \log A} )$, with data-efficiency parameterization and a computationally feasible procedure. This work advances understanding of rate-optimal learning in linear MDPs and provides a practical, theoretically grounded approach for policy optimization under both adversarial and stochastic losses.
Abstract
We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~$K$) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~$K$) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.
