Rate-Optimal Policy Optimization for Linear Markov Decision Processes

Uri Sherman; Alon Cohen; Tomer Koren; Yishay Mansour

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

Uri Sherman, Alon Cohen, Tomer Koren, Yishay Mansour

TL;DR

The paper addresses regret minimization in online episodic linear MDPs with function approximation by proposing an optimistic policy optimization algorithm that works in both adversarial full-information and stochastic bandit feedback settings. A reward-free warmup and restricted value-function design control policy-class capacity and enable tight uniform concentration, leading to a rate of $\tilde{O}(\sqrt{K})$ regret. The main theoretical contribution is a concrete regret bound of $\tilde{O}( d^2 H^{7/2} \sqrt{K \log A} )$, with data-efficiency parameterization and a computationally feasible procedure. This work advances understanding of rate-optimal learning in linear MDPs and provides a practical, theoretically grounded approach for policy optimization under both adversarial and stochastic losses.

Abstract

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~$K$) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~$K$) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

TL;DR

regret. The main theoretical contribution is a concrete regret bound of

, with data-efficiency parameterization and a computationally feasible procedure. This work advances understanding of rate-optimal learning in linear MDPs and provides a practical, theoretically grounded approach for policy optimization under both adversarial and stochastic losses.

Abstract

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal

regret where

denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~

) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~

) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

Paper Structure (17 sections, 3 theorems, 134 equations, 3 algorithms)

This paper contains 17 sections, 3 theorems, 134 equations, 3 algorithms.

Introduction
Summary of contributions
Overview of techniques
Additional related work
Linear MDPs with adversarial costs.
Policy optimization in tabular and linear MDPs.
RL with function approximation
Preliminaries
Episodic MDPs.
Episodic Linear MDPs.
Problem setup.
Learning objective.
Occupancy measures.
Additional notation.
Algorithm and Main Result
...and 2 more sections

Key Result

Theorem 1

Let $\delta > 0$, assume $K\geq H^5 d^4 \log^8 (dHK/\delta)$, $H \geq 3$, $\log A \leq K$, and consider setting $\beta = 2 c_\beta d^{3/2} H \log(d H K/\delta)$ where $c_\beta$ is specified by lem:good_event, $\epsilon_{\rm cov} = {H^{3/2} d^2 \log^4(dHK/\delta)/\sqrt K}$ and $\eta = \sqrt{ \log A} where big-$O$ hides only constant factors independent of problem parameters.

Theorems & Definitions (4)

Definition 1: Linear MDP
Theorem 1
Lemma 1
Lemma 2

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

TL;DR

Abstract

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (4)