Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning
Francesco De Lellis, Marco Coraggio, Giovanni Russo, Mirco Musolesi, Mario di Bernardo
TL;DR
The paper tackles the challenge of guaranteeing control performance when learning-based control is driven purely by data. It introduces a constructive reward-shaping framework consisting of a bounded base reward plus a correction term, along with a discount-based return threshold $\sigma$, to certify that high-return trajectories are acceptable with respect to prescribed settling-time $k_s$ and permanence $k_p$ within a goal region $\mathcal{G}$. By deriving compatibility conditions among shaping parameters and providing an algorithm to compute $r^c$ terms, the approach enables model-free synthesis and validation of acceptable policies, even under uncertain dynamics. The method is validated on two OpenAI Gym tasks, Inverted Pendulum and Lunar Lander, using both Q-learning and Double DQN, demonstrating that the learned policies meet the specified control requirements and highlighting practical considerations such as reward sparsity and exploration effects. This framework offers a principled route to deploy RL-based controllers with verifiable performance guarantees in real-world control settings.
Abstract
In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.
