A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

Weiqin Chen; Mark S. Squillante; Chai Wah Wu; Santiago Paternain

A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

Weiqin Chen, Mark S. Squillante, Chai Wah Wu, Santiago Paternain

TL;DR

This work addresses the challenge of sample-inefficient reinforcement learning by introducing control-based reinforcement learning (CBRL), which directly learns the unknown variables of an underlying control problem to derive the optimal policy. It builds a general theory around a contraction-based CBRL operator and a Q-learning analogue, augmented by a control-policy-variable gradient ascent theorem that ties policy performance to the learned variables, with the linear-quadratic regulator (LQR) as a representative instantiation. The authors prove contraction and convergence properties, establish asymptotic optimality under approximate policy families, and derive a gradient method for updating the learned variables. Empirically, CBRL with LQR (and piecewise-LQR for nonlinear tasks) achieves superior performance, reduced sample complexity, and faster runtimes across Cart Pole, Lunar Lander, Mountain Car, and Pendulum compared with strong baselines, demonstrating practical impact for efficient, robust control-aware RL.

Abstract

We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish various theoretical properties of our approach, such as convergence and optimality of our analog of the Bellman operator and Q-learning, a new control-policy-variable gradient theorem, and a specific gradient ascent algorithm based on this theorem within the context of a specific control-theoretic framework. We empirically evaluate the performance of our control theoretic approach on several classical reinforcement learning tasks, demonstrating significant improvements in solution quality, sample complexity, and running time of our approach over state-of-the-art methods.

A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

TL;DR

Abstract

Paper Structure (37 sections, 5 theorems, 30 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 37 sections, 5 theorems, 30 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
CBRL Approach
Convergence and Optimality
Control-Policy-Variable Gradient Ascent
Experimental Results
Cart Pole under CBRL LQR
Lunar Lander under CBRL LQR
Mountain Car under CBRL Piecewise-LQR
Pendulum under CBRL Piecewise-LQR
Discussion
Conclusion
Control-Based Reinforcement Learning Approach
Proof of Theorem \ref{['contraction']}.
Proof of Theorem \ref{['thm:QLearning']}.
Proof of Theorem \ref{['thm:PWL']}.
...and 22 more sections

Key Result

theorem 1

For any $\gamma \in(0,1)$, the operator ${{\mathbf{T}}}$ in eqn:ContractionOp is a contraction in the supremum norm. Supposing Assumption asm:richness holds for the family of policy functions ${\mathbb{F}}$ and its variable set ${{\mathbb{V}}}$, the contraction operator ${{\mathbf{T}}}$ achieves the

Figures (9)

Figure 1: Learning curves over five independent runs comparing our CBRL approach with the Linear policy, PPO, DQN (discrete actions), and DDPG (continuous actions), where the solid line shows the mean and the shaded area depicts the standard deviation for CartPole ($a$)-($c$), LunarLander ($d$)-($f$), MountainCar ($g$)-($i$), and Pendulum ($j$)-($l$).
Figure 2: The CartPole-v0 environment.
Figure 3: Learning curves of CartPole-v0 over five independent runs. The solid line shows the mean and the shaded area depicts the standard deviation. (a) and (b): Return vs. number of episodes and running time, respectively, for our CBRL approach (over the five independent runs and the four initializations in Table \ref{['tab_init_para_cp']}) in comparison with the Linear policy, PPO, and DQN (over the five independent runs). (c) -- (f): Learning behavior of CBRL variables, initialized by Table \ref{['tab_init_para_cp']}.
Figure 4: The LunarLanderContinuous-v2 environment.
Figure 5: Learning curves of LunarLanderContinuous-v2 over five independent runs. The solid line shows the mean and the shaded area depicts the standard deviation. (a) and (b): Return vs. number of episodes and running time, respectively, for our CBRL approach (over the five independent runs and the four initializations in Table \ref{['tab_init_para_ll']}) in comparison with the Linear policy, PPO, and DDPG (over the five independent runs). (c) -- (f): Learning behavior of CBRL variables, initialized by Table \ref{['tab_init_para_ll']}.
...and 4 more figures

Theorems & Definitions (10)

theorem 1
theorem 2
theorem 3
theorem 4
proof
proof
lemma 1
proof
proof
proof

A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

TL;DR

Abstract

A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (10)