Table of Contents
Fetching ...

No-Regret Reinforcement Learning in Smooth MDPs

Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restell

TL;DR

This work extends the theory of no-regret reinforcement learning to continuous-state/action settings by introducing the ν-smooth MDP framework, unifying several existing models (Lipschitz, Linear, Kernelized) under a single regularity notion. It develops two Legendre-based algorithms, Legendre-Eleanor for Weakly Smooth MDPs and Legendre-LSVI for Strongly Smooth MDPs, both leveraging orthogonal Legendre features to convert the problem into a linear-structure regime and prove no-regret guarantees under corresponding smoothness conditions. Theoretical results give explicit regret bounds that adapt to the smoothness parameter ν, with Legendre-Eleanor achieving R_K on the order of $K^{(3d/2+ν+1)/(d+2(ν+1))}$ (up to log factors) and Legendre-LSVI on the order of $H^{3/2}K^{(2d+ν+1)/(d+2(ν+1))}$, while remaining computationally efficient for the Strongly Smooth case. The paper also shows that kernel methods (Matérn kernels) yield strong smoothness and places the proposed bounds in the broader literature, including Lipschitz, LQR, linear MDPs, and kernelized MDPs, offering a practical, theory-backed path toward no-regret RL in diverse continuous environments.

Abstract

Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely $ν-$smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in $ν-$smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, \textsc{Legendre-Eleanor}, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, \textsc{Legendre-LSVI}, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.

No-Regret Reinforcement Learning in Smooth MDPs

TL;DR

This work extends the theory of no-regret reinforcement learning to continuous-state/action settings by introducing the ν-smooth MDP framework, unifying several existing models (Lipschitz, Linear, Kernelized) under a single regularity notion. It develops two Legendre-based algorithms, Legendre-Eleanor for Weakly Smooth MDPs and Legendre-LSVI for Strongly Smooth MDPs, both leveraging orthogonal Legendre features to convert the problem into a linear-structure regime and prove no-regret guarantees under corresponding smoothness conditions. Theoretical results give explicit regret bounds that adapt to the smoothness parameter ν, with Legendre-Eleanor achieving R_K on the order of (up to log factors) and Legendre-LSVI on the order of , while remaining computationally efficient for the Strongly Smooth case. The paper also shows that kernel methods (Matérn kernels) yield strong smoothness and places the proposed bounds in the broader literature, including Lipschitz, LQR, linear MDPs, and kernelized MDPs, offering a practical, theory-backed path toward no-regret RL in diverse continuous environments.

Abstract

Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, \textsc{Legendre-Eleanor}, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, \textsc{Legendre-LSVI}, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
Paper Structure (30 sections, 20 theorems, 103 equations, 2 figures, 1 table)

This paper contains 30 sections, 20 theorems, 103 equations, 2 figures, 1 table.

Key Result

Theorem 1

Let us consider a Weakly Smooth MDP $M$ with state action space $[-1,1]^d$. Under the condition that $\nu>d/2-1$, Legendre-Eleanor initialized with $N=\lceil K^{\frac{1}{d+2(\nu+1)}} \rceil$, with probability at least $1-\delta$, suffers a regret of order at most: where the constant depends only on $d$ and $\nu$ and the $\widetilde{\mathcal{O}}$ hides logarithmic functions of $K$, $\delta$.

Figures (2)

  • Figure 1: Curve of the episodic return for the simulation in Section \ref{['sec:expe']} with 95% confidence intervals over five random seeds.
  • Figure 2: A schematic summarizing relations among families of continuous space RL problems. Our assumptions correspond to the red and orange sets.

Theorems & Definitions (35)

  • Definition 4.1: Legendre feature map
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • proof
  • Theorem 6
  • proof
  • Proposition 7
  • ...and 25 more