Table of Contents
Fetching ...

Adaptive Control of Positive Systems with Application to Learning SSP

Fethi Bencherki, Anders Rantzer

TL;DR

This paper develops an online, data-driven adaptive controller for infinite-horizon optimization of positive systems. It derives a data-driven algebraic equation from the model-free Bellman equation and learns the $Q$-factor from data using correlation matrices $\Sigma(t)$ and $\bar{\Sigma}(t)$, enabling policy extraction without explicit system identification and with robustness to unmodeled dynamics. Theoretical results provide perturbation and suboptimality bounds that relate learning errors to stability, through quantities $\rho$ and $\beta$ and Lyapunov function $V(x)=p^\top x$. Numerical experiments on a Stochastic Shortest Path problem show sublinear regret and superior performance of the adaptive controller compared to model-free SSP methods, highlighting practical potential for online routing and related applications.

Abstract

An adaptive controller is proposed and analyzed for the class of infinite-horizon optimal control problems in positive linear systems presented in (Ohlin et al., 2024b). This controller is derived from the solution of a "data-driven algebraic equation" constructed using the model-free Bellman equation from Q-learning. The equation is driven by data correlation matrices that do not scale with the number of data points, enabling efficient online implementation. Consequently, a sufficient condition guaranteeing stability and robustness to unmodeled dynamics is established. The derived results also provide a quantitative characterization of the interplay between excitation level and robustness to unmodeled dynamics. The class of optimal control problems considered here is equivalent to Stochastic Shortest Path (SSP) problems, allowing for a performance comparison between the proposed adaptive policy and model-free algorithms for learning the stochastic shortest path, as demonstrated in the numerical experiment.

Adaptive Control of Positive Systems with Application to Learning SSP

TL;DR

This paper develops an online, data-driven adaptive controller for infinite-horizon optimization of positive systems. It derives a data-driven algebraic equation from the model-free Bellman equation and learns the -factor from data using correlation matrices and , enabling policy extraction without explicit system identification and with robustness to unmodeled dynamics. Theoretical results provide perturbation and suboptimality bounds that relate learning errors to stability, through quantities and and Lyapunov function . Numerical experiments on a Stochastic Shortest Path problem show sublinear regret and superior performance of the adaptive controller compared to model-free SSP methods, highlighting practical potential for online routing and related applications.

Abstract

An adaptive controller is proposed and analyzed for the class of infinite-horizon optimal control problems in positive linear systems presented in (Ohlin et al., 2024b). This controller is derived from the solution of a "data-driven algebraic equation" constructed using the model-free Bellman equation from Q-learning. The equation is driven by data correlation matrices that do not scale with the number of data points, enabling efficient online implementation. Consequently, a sufficient condition guaranteeing stability and robustness to unmodeled dynamics is established. The derived results also provide a quantitative characterization of the interplay between excitation level and robustness to unmodeled dynamics. The class of optimal control problems considered here is equivalent to Stochastic Shortest Path (SSP) problems, allowing for a performance comparison between the proposed adaptive policy and model-free algorithms for learning the stochastic shortest path, as demonstrated in the numerical experiment.

Paper Structure

This paper contains 23 sections, 5 theorems, 58 equations, 1 figure.

Key Result

Lemma 1

Iterating on $q$ in value-itr-q2 is algebraically equivalent to iterating on $p$ in value_itr_p.

Figures (1)

  • Figure 1: Each plot represents the average over 100 repeated runs, with the shaded area indicating the 95% confidence interval.

Theorems & Definitions (13)

  • Remark 1
  • Definition 1
  • Remark 2
  • Remark 3
  • Lemma 1
  • Remark 4
  • Lemma 2
  • Theorem 1
  • Remark 5
  • Theorem 2
  • ...and 3 more