Table of Contents
Fetching ...

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

Uri Sherman, Alon Cohen, Tomer Koren, Yishay Mansour

TL;DR

The paper addresses regret minimization in online episodic linear MDPs with function approximation by proposing an optimistic policy optimization algorithm that works in both adversarial full-information and stochastic bandit feedback settings. A reward-free warmup and restricted value-function design control policy-class capacity and enable tight uniform concentration, leading to a rate of $\tilde{O}(\sqrt{K})$ regret. The main theoretical contribution is a concrete regret bound of $\tilde{O}( d^2 H^{7/2} \sqrt{K \log A} )$, with data-efficiency parameterization and a computationally feasible procedure. This work advances understanding of rate-optimal learning in linear MDPs and provides a practical, theoretically grounded approach for policy optimization under both adversarial and stochastic losses.

Abstract

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~$K$) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~$K$) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

TL;DR

The paper addresses regret minimization in online episodic linear MDPs with function approximation by proposing an optimistic policy optimization algorithm that works in both adversarial full-information and stochastic bandit feedback settings. A reward-free warmup and restricted value-function design control policy-class capacity and enable tight uniform concentration, leading to a rate of regret. The main theoretical contribution is a concrete regret bound of , with data-efficiency parameterization and a computationally feasible procedure. This work advances understanding of rate-optimal learning in linear MDPs and provides a practical, theoretically grounded approach for policy optimization under both adversarial and stochastic losses.

Abstract

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal regret where denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.
Paper Structure (17 sections, 3 theorems, 134 equations, 3 algorithms)

This paper contains 17 sections, 3 theorems, 134 equations, 3 algorithms.

Key Result

Theorem 1

Let $\delta > 0$, assume $K\geq H^5 d^4 \log^8 (dHK/\delta)$, $H \geq 3$, $\log A \leq K$, and consider setting $\beta = 2 c_\beta d^{3/2} H \log(d H K/\delta)$ where $c_\beta$ is specified by lem:good_event, $\epsilon_{\rm cov} = {H^{3/2} d^2 \log^4(dHK/\delta)/\sqrt K}$ and $\eta = \sqrt{ \log A} where big-$O$ hides only constant factors independent of problem parameters.

Theorems & Definitions (4)

  • Definition 1: Linear MDP
  • Theorem 1
  • Lemma 1
  • Lemma 2