Technical Report on Reinforcement Learning Control on the Lucas-Nülle Inverted Pendulum
Maximilian Schenke, Shalbus Bukarov
TL;DR
This work demonstrates reinforcement learning-based control for swing-up and stabilization of an inverted pendulum on educational hardware, using a Deep Deterministic Policy Gradient (DDPG) framework with separate actor and critic networks. The return $g$ is maximized with $g_k = r_{k+1} + \gamma r_{k+2} + \dots$ and the optimal action given by $a^*_k = \arg\max_a q(\mathbf{o}_k,a)$, while observing a Markov state through a designed feature vector and velocity estimates obtained via a PLL-based estimator. Experiments show convergence over 30 minutes of training and successful swing-up with safeguarding constraints, though exact $x_{ref}$ tracking remains challenging; safeguarding is shown to be essential during training, and future work points to faster convergence and reward redesign to reduce reliance on safety constraints. The approach provides an educative, model-free alternative to classical controllers, with practical implications for hands-on RL in control education.
Abstract
The discipline of automatic control is making increased use of concepts that originate from the domain of machine learning. Herein, reinforcement learning (RL) takes an elevated role, as it is inherently designed for sequential decision making, and can be applied to optimal control problems without the need for a plant system model. To advance education of control engineers and operators in this field, this contribution targets an RL framework that can be applied to educational hardware provided by the Lucas-Nülle company. Specifically, the goal of inverted pendulum control is pursued by means of RL, including both, swing-up and stabilization within a single holistic design approach. Herein, the actual learning is enabled by separating corresponding computations from the real-time control computer and outsourcing them to a different hardware. This distributed architecture, however, necessitates communication of the involved components, which is realized via CAN bus. The experimental proof of concept is presented with an applied safeguarding algorithm that prevents the plant from being operated harmfully during the trial-and-error training phase.
