Table of Contents
Fetching ...

Logarithmic Regret for Nonlinear Control

James Wang, Bruce D. Lee, Ingvar Ziemann, Nikolai Matni

TL;DR

The paper tackles learning to control unknown nonlinear dynamical systems with potentially dangerous consequences, and establishes conditions under which the regret grows polylogarithmically with the number of episodes $N$. It introduces a two-phase certainty-equivalent based algorithm, LogRegret, that first constructs a confidence set for the unknown dynamics and then optimizes policies online within that set, leveraging a persistently exciting optimal policy to ensure strong convexity of the prediction-error objective. When identifiability fails, a slower $O(\sqrt{N})$ regret bound is shown via an Explore-Then-Commit approach; the theory is complemented by numerical validation on a toy system and a cartpole example, illustrating fast convergence of cost and sublinear regret in practice. Overall, the work provides the first regret bounds for nonlinear dynamical systems with nonlinear parameter dependence and offers a principled path to fast, safe learning in continuous control, with potential extensions to online experiment design and single-trajectory settings.

Abstract

We address the problem of learning to control an unknown nonlinear dynamical system through sequential interactions. Motivated by high-stakes applications in which mistakes can be catastrophic, such as robotics and healthcare, we study situations where it is possible for fast sequential learning to occur. Fast sequential learning is characterized by the ability of the learning agent to incur logarithmic regret relative to a fully-informed baseline. We demonstrate that fast sequential learning is achievable in a diverse class of continuous control problems where the system dynamics depend smoothly on unknown parameters, provided the optimal control policy is persistently exciting. Additionally, we derive a regret bound which grows with the square root of the number of interactions for cases where the optimal policy is not persistently exciting. Our results provide the first regret bounds for controlling nonlinear dynamical systems depending nonlinearly on unknown parameters. We validate the trends our theory predicts in simulation on a simple dynamical system.

Logarithmic Regret for Nonlinear Control

TL;DR

The paper tackles learning to control unknown nonlinear dynamical systems with potentially dangerous consequences, and establishes conditions under which the regret grows polylogarithmically with the number of episodes . It introduces a two-phase certainty-equivalent based algorithm, LogRegret, that first constructs a confidence set for the unknown dynamics and then optimizes policies online within that set, leveraging a persistently exciting optimal policy to ensure strong convexity of the prediction-error objective. When identifiability fails, a slower regret bound is shown via an Explore-Then-Commit approach; the theory is complemented by numerical validation on a toy system and a cartpole example, illustrating fast convergence of cost and sublinear regret in practice. Overall, the work provides the first regret bounds for nonlinear dynamical systems with nonlinear parameter dependence and offers a principled path to fast, safe learning in continuous control, with potential extensions to online experiment design and single-trajectory settings.

Abstract

We address the problem of learning to control an unknown nonlinear dynamical system through sequential interactions. Motivated by high-stakes applications in which mistakes can be catastrophic, such as robotics and healthcare, we study situations where it is possible for fast sequential learning to occur. Fast sequential learning is characterized by the ability of the learning agent to incur logarithmic regret relative to a fully-informed baseline. We demonstrate that fast sequential learning is achievable in a diverse class of continuous control problems where the system dynamics depend smoothly on unknown parameters, provided the optimal control policy is persistently exciting. Additionally, we derive a regret bound which grows with the square root of the number of interactions for cases where the optimal policy is not persistently exciting. Our results provide the first regret bounds for controlling nonlinear dynamical systems depending nonlinearly on unknown parameters. We validate the trends our theory predicts in simulation on a simple dynamical system.
Paper Structure (25 sections, 12 theorems, 73 equations, 3 figures, 2 algorithms)

This paper contains 25 sections, 12 theorems, 73 equations, 3 figures, 2 algorithms.

Key Result

Theorem 1

If the optimal policy solving a given continuous control task is identifiable from an experiment running the optimal policy, polylogarithmic regret is attained by our Algorithm alg:LogRegret.

Figures (3)

  • Figure 1: Average regret incurred by Algorithm \ref{['alg:LogRegret']} on the toy dynamical system \ref{['expr: toy system']}, versus iterations and $\log(\text{iterations})$, respectively. The mean over 30 runs is shown, with the standard error shaded.
  • Figure 2: The first plot shows average cost incurred by \ref{['alg:LogRegret']} on the cartpole system \ref{['cartpole system 1']} - \ref{['cartpole system 2']}, versus iterations. The mean over 30 runs is shown in blue, with standard error shaded. The cost of a "best-in-class" controller is shown with the dashed black line. The second and third plots show average regret versus iterations and the logarithm of iterations, respectively.
  • Figure 3: Explore-Then-Commit

Theorems & Definitions (13)

  • Theorem 1: Informal version of the main result
  • Definition 2
  • Theorem 3
  • Corollary 4
  • Theorem 5
  • Lemma 1: Lemma A.1 of lee2024active
  • Lemma 2: Modified from Lemma 3.1 of lee2024active
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • ...and 3 more