Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

Asaf Cassel; Aviv Rosenberg

Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

Asaf Cassel, Aviv Rosenberg

TL;DR

This work introduces Contracted Features Policy Optimization (CFPO), a contraction-based approach to policy optimization in linear MDPs that removes the need for reward-free warm-up phases. By defining contracted features and a contracted (sub) MDP, CFPO achieves rate-optimal regret in both adversarial full-information and stochastic bandit settings, with a bound of $O\left( \sqrt{K d^3 H^4 \log(K) \log(KH/\delta)} + \sqrt{K d H^5 \log(K) \log|\mathcal{A}|} \right)$. The analysis hinges on a novel regret decomposition and an elliptical-potential-based control of estimation errors under the contraction, yielding improved dependence on horizon $H$ and feature dimension $d$ relative to prior warm-up–dependent methods. The approach is practical, reward-aware, and computationally comparable to existing linear MDP PO algorithms, offering a meaningful advance for regret minimization under function approximation in RL.

Abstract

Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.

Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

TL;DR

. The analysis hinges on a novel regret decomposition and an elliptical-potential-based control of estimation errors under the contraction, yielding improved dependence on horizon

and feature dimension

relative to prior warm-up–dependent methods. The approach is practical, reward-aware, and computationally comparable to existing linear MDP PO algorithms, offering a meaningful advance for regret minimization under function approximation in RL.

Abstract

Paper Structure (34 sections, 25 theorems, 75 equations, 1 algorithm)

This paper contains 34 sections, 25 theorems, 75 equations, 1 algorithm.

Introduction
Related work
Policy optimization in tabular MDPs.
Other regret minimization methods in tabular MDPs.
Policy optimization in linear MDPs.
Other regret minimization methods in linear MDPs and other models for function approximation.
Problem setup
Episodic Markov Decision Process (MDP).
Linear MDP.
Policy and value.
Interaction protocol and regret.
Notation.
The role of value clipping
Algorithm and main result
Discussion.
...and 19 more sections

Key Result

theorem 1

Suppose that we run CFPO (alg:r-opo-for-linear-mdp-regular-bonus) with the parameters defined in thm:regret-bound-PO-linear-regular-bonus (in appendix-sec:analysis). Then, with probability at least $1 - \delta$, we have

Theorems & Definitions (40)

theorem 1
Lemma 2
Proof
Lemma 3
Proof
Lemma 4: Optimism
Proof
Lemma 5: Cost of optimism
Proof
Lemma 6: Good event
...and 30 more

Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

TL;DR

Abstract

Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (40)