Table of Contents
Fetching ...

Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator

Wenhao Xu, Xuefeng Gao, Xuedong He

TL;DR

This work tackles online learning for the finite-horizon episodic risk-sensitive LQR (LEQR) with continuous state-action spaces. It presents two simple LS-based algorithms: a logarithmic-regret method under a self-exploration identifiability condition and a square-root-regret method when that condition fails but exploration noise is injected. The analysis hinges on a novel perturbation study of the LEQR Riccati equations and a careful treatment of the risk-sensitive performance loss, yielding the first regret bounds for episodic LEQR. These results advance understanding of risk-sensitive online control and establish practical guidance for deploying LEQR in settings like finance and robotics. The work also paves the way for future exploration of infinite-horizon, lower bounds, and broader risk measures in online risk-sensitive control.

Abstract

Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control. In this paper, we study online adaptive control of risk-sensitive linear quadratic regulator in the finite horizon episodic setting. We propose a simple least-squares greedy algorithm and show that it achieves $\widetilde{\mathcal{O}}(\log N)$ regret under a specific identifiability assumption, where $N$ is the total number of episodes. If the identifiability assumption is not satisfied, we propose incorporating exploration noise into the least-squares-based algorithm, resulting in an algorithm with $\widetilde{\mathcal{O}}(\sqrt{N})$ regret. To our best knowledge, this is the first set of regret bounds for episodic risk-sensitive linear quadratic regulator. Our proof relies on perturbation analysis of less-standard Riccati equations for risk-sensitive linear quadratic control, and a delicate analysis of the loss in the risk-sensitive performance criterion due to applying the suboptimal controller in the online learning process.

Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator

TL;DR

This work tackles online learning for the finite-horizon episodic risk-sensitive LQR (LEQR) with continuous state-action spaces. It presents two simple LS-based algorithms: a logarithmic-regret method under a self-exploration identifiability condition and a square-root-regret method when that condition fails but exploration noise is injected. The analysis hinges on a novel perturbation study of the LEQR Riccati equations and a careful treatment of the risk-sensitive performance loss, yielding the first regret bounds for episodic LEQR. These results advance understanding of risk-sensitive online control and establish practical guidance for deploying LEQR in settings like finance and robotics. The work also paves the way for future exploration of infinite-horizon, lower bounds, and broader risk measures in online risk-sensitive control.

Abstract

Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control. In this paper, we study online adaptive control of risk-sensitive linear quadratic regulator in the finite horizon episodic setting. We propose a simple least-squares greedy algorithm and show that it achieves regret under a specific identifiability assumption, where is the total number of episodes. If the identifiability assumption is not satisfied, we propose incorporating exploration noise into the least-squares-based algorithm, resulting in an algorithm with regret. To our best knowledge, this is the first set of regret bounds for episodic risk-sensitive linear quadratic regulator. Our proof relies on perturbation analysis of less-standard Riccati equations for risk-sensitive linear quadratic control, and a delicate analysis of the loss in the risk-sensitive performance criterion due to applying the suboptimal controller in the online learning process.
Paper Structure (32 sections, 29 theorems, 194 equations, 3 figures, 2 algorithms)

This paper contains 32 sections, 29 theorems, 194 equations, 3 figures, 2 algorithms.

Key Result

Theorem 1

Suppose Assumption ASSU1 holds and assume the optimal controller for the initial estimate $\theta^1$ also satisfy (PSDKK). Fix $\delta\in (0,\frac{3}{\pi^2})$. Then we can choose $m_1=\mathcal{C}_0(-\log\delta)$ for some positive constant $\mathcal{C}_0$ such that with probability at least $1-\frac{ where $\mathcal{C}$ is a constant independent of $N$ and $(\psi_t)$ is a sequence recursively defin

Figures (3)

  • Figure 1: Simulation results in System 1
  • Figure 2: Simulation results in System 2
  • Figure 3: Simulation results in System 3

Theorems & Definitions (49)

  • Theorem 1
  • Proposition 1: Informal
  • Theorem 2
  • Proposition 2: informal
  • Proposition 3
  • Definition 1: Definition 2.7 of wainwright2019high
  • Lemma 1: Bernstein Inequality, Proposition 2.9 of wainwright2019high
  • Lemma 2: Lemma 5.1 of Alessandro2018AST
  • Lemma 3: Lemma 2.7.7 of vershynin2018high
  • Lemma 4
  • ...and 39 more