Table of Contents
Fetching ...

Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning

Francesco De Lellis, Marco Coraggio, Giovanni Russo, Mirco Musolesi, Mario di Bernardo

TL;DR

The paper tackles the challenge of guaranteeing control performance when learning-based control is driven purely by data. It introduces a constructive reward-shaping framework consisting of a bounded base reward plus a correction term, along with a discount-based return threshold $\sigma$, to certify that high-return trajectories are acceptable with respect to prescribed settling-time $k_s$ and permanence $k_p$ within a goal region $\mathcal{G}$. By deriving compatibility conditions among shaping parameters and providing an algorithm to compute $r^c$ terms, the approach enables model-free synthesis and validation of acceptable policies, even under uncertain dynamics. The method is validated on two OpenAI Gym tasks, Inverted Pendulum and Lunar Lander, using both Q-learning and Double DQN, demonstrating that the learned policies meet the specified control requirements and highlighting practical considerations such as reward sparsity and exploration effects. This framework offers a principled route to deploy RL-based controllers with verifiable performance guarantees in real-world control settings.

Abstract

In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.

Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning

TL;DR

The paper tackles the challenge of guaranteeing control performance when learning-based control is driven purely by data. It introduces a constructive reward-shaping framework consisting of a bounded base reward plus a correction term, along with a discount-based return threshold , to certify that high-return trajectories are acceptable with respect to prescribed settling-time and permanence within a goal region . By deriving compatibility conditions among shaping parameters and providing an algorithm to compute terms, the approach enables model-free synthesis and validation of acceptable policies, even under uncertain dynamics. The method is validated on two OpenAI Gym tasks, Inverted Pendulum and Lunar Lander, using both Q-learning and Double DQN, demonstrating that the learned policies meet the specified control requirements and highlighting practical considerations such as reward sparsity and exploration effects. This framework offers a principled route to deploy RL-based controllers with verifiable performance guarantees in real-world control settings.

Abstract

In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.
Paper Structure (30 sections, 10 theorems, 41 equations, 7 figures, 1 algorithm)

This paper contains 30 sections, 10 theorems, 41 equations, 7 figures, 1 algorithm.

Key Result

Proposition 3.5

Let Assumptions ass:reward_structure and ass:inequalities_reward hold. Then, high-return state-space sequences are acceptable.

Figures (7)

  • Figure 1: (a): A state-space sequence$\xi$, a trajectory $\phi^\pi(\tilde{x}_0)$, and a goal region $\mathcal{G}$ (see Section\ref{['sec:problem_statement']}); while a state-space sequence is simply a sequence of points in the state space $\mathcal{X}$, a trajectory is generated by applying a policy to the dynamics in\ref{['eq:dynamical_system']}. (b): Terms of the reward structure in Assumption \ref{['ass:reward_structure']}.
  • Figure 2: Schematic representation of the main assumptions and results in Section \ref{['sec:main_result']}. Green blocks denote assumptions, blue blocks indicate analytical findings, yellow blocks denote algorithms, and purple blocks refer to the problems being studied. Dashed arrows denote optional steps in the control design. "SSS" means "state-space sequence"; The symbols in the figure are defined in Section \ref{['sec:main_result']}.
  • Figure 3: Sketch representation of the environments used in the numerical validation in Section \ref{['sec:numerical_results']}: (a) Inverted Pendulum and (b) Lunar Lander. Both the pendulum and the lander are depicted in their initial states.
  • Figure 4: Average (blue line) plus/minus standard deviation (shaded area) of $\left\lVert x_k - x^\mathrm{ref} \right\rVert$(top panel), angular position $x_{k,1}$ (middle panel), and angular velocity $x_{k,2}$ (bottom panel), obtained by $S$ policies trained with Q-learning in the pendulum environment. The green solid line (top panel) indicates the goal region $\mathcal{G}$; the green dashed line (middle and bottom panel) indicate neighborhoods of width $2\theta$ centered in $x_{1,k} = x^\mathrm{ref}_1 = 0$ (middle panel) and in $x_{2,k} = x^\mathrm{ref}_2 = 0$ (bottom panel). The red line indicates the time instant when the (averaged) trajectory enters the goal region.
  • Figure 5: Average (green line) plus/minus standard deviation (shaded area) of the discounted returns per episode obtained in $S$ training sessions with Q-learning in the pendulum environment. The red line indicates the threshold value $\sigma$ (cf. Sec. \ref{['sec:assessing_acceptable_state_sequences']}). The returns are averaged backwards across episodes using a moving window of 50 samples.
  • ...and 2 more figures

Theorems & Definitions (31)

  • Definition 2.1: State-space sequences
  • Definition 2.2: First exit instant
  • Definition 2.3: Acceptable state-space sequences, trajectories and policies
  • Remark 3.2: Generality of Assumption \ref{['ass:reward_structure']}
  • Definition 3.3: High-return state-space sequences, trajectories and policies
  • Proposition 3.5
  • proof
  • Corollary 3.7
  • proof
  • Remark 3.8: Selection of $k_\mathrm{p}$
  • ...and 21 more