Data-Based Efficient Off-Policy Stabilizing Optimal Control Algorithms for Discrete-Time Linear Systems via Damping Coefficients
Dongdong Li, Jiuxiang Dong
TL;DR
This work addresses stabilizing optimal control for discrete-time linear systems with unknown dynamics by introducing two model-free, off-policy algorithms that rely on damping-coefficient based homotopy. The first algorithm extends policy iteration to a data-driven setting, while the second employs off-policy Q-learning to estimate both stabilizing gains and ARE-like solutions, without applying current gains to the plant. Both methods provide explicit damping-update rules and persistently excited data conditions to guarantee convergence to the optimal gain and ARE solution, with simulations validating rapid convergence on an unstable DT system. The approach reduces reliance on accurate system models and enables efficient, data-driven stabilization and near-optimal control in practice, with clear paths for extension to broader DT systems and real-world scenarios.
Abstract
Policy iteration is one of the classical frameworks of reinforcement learning, which requires a known initial stabilizing control. However, finding the initial stabilizing control depends on the known system model. To relax this requirement and achieve model-free optimal control, in this paper, two different reinforcement learning algorithms based on policy iteration and variable damping coefficients are designed for unknown discrete-time linear systems. First, a stable artificial system is designed, and this system is gradually iterated to the original system by varying the damping coefficients. This allows the initial stabilizing control to be obtained in a finite number of iteration steps. Then, an off-policy iteration algorithm and an off-policy $\mathcal{Q}$-learning algorithm are designed to select the appropriate damping coefficients and realize data-driven. In these two algorithms, the current estimates of optimal control gain are not applied to the system to re-collect data. Moreover, they are characterized by the fast convergence of the traditional policy iteration. Finally, the proposed algorithms are validated by simulation.
