Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: Generalized Baselines

Benjamin Schiffer; Lucas Janson

Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: Generalized Baselines

Benjamin Schiffer, Lucas Janson

TL;DR

This work studies safe online reinforcement learning for a one-dimensional Linear Quadratic Regulator with unknown dynamics. It introduces a general framework that analyzes nonlinear baseline controllers beyond safe linear policies, and proves two key regret guarantees: a $\tilde{O}_T(\sqrt{T})$ rate under large-support noise and a $\tilde{O}_T(T^{2/3})$ rate for subgaussian noise, relative to the best safe baseline. A novel nonlinear uncertainty bound shows that enforcing safety can enable productive exploration, effectively delivering a form of ``free exploration'' under sufficient noise. The results span general nonlinear baselines and provide constructive certainty-equivalence algorithms, with extensions discussed toward higher dimensions and joint state-control constraints in future work.

Abstract

Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we establish theoretical foundations for reinforcement learning with safety constraints by studying the canonical problem of Linear Quadratic Regulator learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Our primary contribution is a general framework for studying stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to the difficulty of analyzing non-linear controllers in a constrained problem, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Using our framework, we show that for \emph{any} non-linear baseline satisfying natural assumptions, $\tilde{O}_T(\sqrt{T})$-regret is possible when the noise distribution has sufficiently large support, and $\tilde{O}_T(T^{2/3})$-regret is possible for \emph{any} subgaussian noise distribution. In proving these results, we introduce a new uncertainty estimation bound for nonlinear controls which shows that enforcing safety in the presence of sufficient noise can provide ``free exploration'' that compensates for the added cost of uncertainty in safety-constrained control.

Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: Generalized Baselines

TL;DR

Abstract

Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: Generalized Baselines

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (82)