Low-Rank MDPs with Continuous Action Spaces

Andrew Bennett; Nathan Kallus; Miruna Oprescu

Low-Rank MDPs with Continuous Action Spaces

Andrew Bennett, Nathan Kallus, Miruna Oprescu

TL;DR

The paper tackles extending provably efficient reinforcement learning bounds from finite to continuous action spaces within the low-rank MDP framework. It develops two main strategies—action-space smoothness of error terms and policy smoothing—to adapt PAC guarantees without altering the FLAMBE algorithm, under Hölder-type smoothness conditions on transitions and rewards. Case studies on FLAMBE (and RAFFLE in the appendix) derive polynomial-sample bounds that depend on action-dimension, rank, and smoothness, with explicit expressions such as tau = m/α_E and sigma = m/min(α_T,α_R). The work provides a practical blueprint for enabling PAC RL with continuous actions in low-rank MDPs and discusses concrete implementation options for the elliptical planner.

Abstract

Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Hölder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Hölder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.

Low-Rank MDPs with Continuous Action Spaces

TL;DR

Abstract

, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Hölder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Hölder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.

Paper Structure (40 sections, 11 theorems, 73 equations)

This paper contains 40 sections, 11 theorems, 73 equations.

INTRODUCTION
Related Work
Overview of Paper
Notation
LOW-RANK MDP SETTING
LIMITATIONS OF EXISTING LOW-RANK MDP RESULTS
ERROR BOUNDS WITH CONTINUOUS ACTIONS
Utilizing Smoothness of Error Functions
Using Smoothed Policies
CASE STUDY: FLAMBE
FLAMBE Algorithm
Smoothness Assumption for Extension
Bound for Restricted Policies
Bound for Unrestricted Policies
Discussion of Implementation
...and 25 more sections

Key Result

Lemma 1

For any given $f : {\mathcal{S}} \times \mathcal{A} \to \mathbb{R}^+$, any distribution $\rho$ over states, and any policy $\pi$, we have

Theorems & Definitions (23)

Definition 1: Low-rank MDP
Lemma 1
Definition 2: $\alpha$-smooth functions
Definition 3: $\alpha$-smoothness norm
Theorem 2: Uniform bound on $\alpha$-smooth functions
Lemma 3
Definition 4
Theorem 4
Lemma 5
Theorem 6
...and 13 more

Low-Rank MDPs with Continuous Action Spaces

TL;DR

Abstract

Low-Rank MDPs with Continuous Action Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (23)