Table of Contents
Fetching ...

The Feasibility Theory of Constrained Reinforcement Learning: A Tutorial Study

Yujie Yang, Zhilong Zheng, Masayoshi Tomizuka, Changliu Liu, Shengbo Eben Li

TL;DR

The paper tackles safety-constrained control by addressing the infeasibility that can arise for non-optimal RL policies and for MPC, proposing a unified feasibility theory that separates policy solving in a virtual-time domain from policy execution in a real-time domain. It introduces initial feasibility, endless feasibility, and the maximum endlessly feasible region, and presents containment and equivalence results that relate policy-specific and problem-wide feasibility. The core contribution is a feasibility-function framework that splits constraints into two families: control invariant-set (Type I) and constraint aggregation (Type II), instantiated via CBF, SI, CVF, HJ reachability, and CDF, with practical design rules to maximize feasible regions. The theory is validated through visualizations of feasible regions in emergency braking and unicycle obstacle avoidance tasks, highlighting how RL can progressively expand the feasible region and approach the maximum EFR, thereby improving safety during learning and deployment. Overall, the work provides a principled toolkit for designing virtual-time constraints that ensure long-term safety across both MPC and RL settings and offers guidance for achieving the largest possible safe operating region.

Abstract

Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy. However, reinforcement learning (RL), as another important control method, solves the optimal policy in an iterative manner, which comes with a series of non-optimal intermediate policies. Feasibility analysis of these non-optimal policies is also necessary for iteratively improving constraint satisfaction; but that is not available under existing MPC feasibility theories. This paper proposes a feasibility theory that applies to both MPC and RL by filling in the missing part of feasibility analysis for an arbitrary policy. The basis of our theory is to decouple policy solving and implementation into two temporal domains: virtual-time domain and real-time domain. This allows us to separately define initial and endless, state and policy feasibility, and their corresponding feasible regions. Based on these definitions, we analyze the containment relationships between different feasible regions, which enables us to describe the feasible region of an arbitrary policy. We further provide virtual-time constraint design rules along with a practical design tool called feasibility function that helps to achieve the maximum feasible region. We review most of existing constraint formulations and point out that they are essentially applications of feasibility functions in different forms. We demonstrate our feasibility theory by visualizing different feasible regions under both MPC and RL policies in an emergency braking control task.

The Feasibility Theory of Constrained Reinforcement Learning: A Tutorial Study

TL;DR

The paper tackles safety-constrained control by addressing the infeasibility that can arise for non-optimal RL policies and for MPC, proposing a unified feasibility theory that separates policy solving in a virtual-time domain from policy execution in a real-time domain. It introduces initial feasibility, endless feasibility, and the maximum endlessly feasible region, and presents containment and equivalence results that relate policy-specific and problem-wide feasibility. The core contribution is a feasibility-function framework that splits constraints into two families: control invariant-set (Type I) and constraint aggregation (Type II), instantiated via CBF, SI, CVF, HJ reachability, and CDF, with practical design rules to maximize feasible regions. The theory is validated through visualizations of feasible regions in emergency braking and unicycle obstacle avoidance tasks, highlighting how RL can progressively expand the feasible region and approach the maximum EFR, thereby improving safety during learning and deployment. Overall, the work provides a principled toolkit for designing virtual-time constraints that ensure long-term safety across both MPC and RL settings and offers guidance for achieving the largest possible safe operating region.

Abstract

Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy. However, reinforcement learning (RL), as another important control method, solves the optimal policy in an iterative manner, which comes with a series of non-optimal intermediate policies. Feasibility analysis of these non-optimal policies is also necessary for iteratively improving constraint satisfaction; but that is not available under existing MPC feasibility theories. This paper proposes a feasibility theory that applies to both MPC and RL by filling in the missing part of feasibility analysis for an arbitrary policy. The basis of our theory is to decouple policy solving and implementation into two temporal domains: virtual-time domain and real-time domain. This allows us to separately define initial and endless, state and policy feasibility, and their corresponding feasible regions. Based on these definitions, we analyze the containment relationships between different feasible regions, which enables us to describe the feasible region of an arbitrary policy. We further provide virtual-time constraint design rules along with a practical design tool called feasibility function that helps to achieve the maximum feasible region. We review most of existing constraint formulations and point out that they are essentially applications of feasibility functions in different forms. We demonstrate our feasibility theory by visualizing different feasible regions under both MPC and RL policies in an emergency braking control task.
Paper Structure (31 sections, 7 theorems, 63 equations, 26 figures)

This paper contains 31 sections, 7 theorems, 63 equations, 26 figures.

Key Result

Theorem 4.1

For an arbitrary state $x$, the following two statements 1) and 2) are equivalent:

Figures (26)

  • Figure 1: Illustration of core concepts in feasibility theory. The second row shows state trajectories and feasible regions of an RL policy trained for 50 steps under a Hamilton-Jacobi reachability constraint, with the first row demonstrating the first virtual-time step corresponding to each real-time step. The rest grey shaded squares are stacked along the virtual-time axis. The third row shows the results of 10000 training steps. The purple-shaded areas represent real-time constraints, and the orange-shaded areas represent virtual-time constraints. The circles stand for the states at every step on a trajectory. The squares and diamonds stand for regular states so as to show different feasibility regions. Regardless of the shape, all blue marks represent initially feasible states, all red marks for endlessly feasible states, and all gray marks for infeasible states. Intuitively, "initially feasible" means that there exists a policy satisfying the virtual-time constraint at the current step, while "endlessly feasible" means that such a policy will always exist. A state with a red cross indicates violation of real-time or virtual-time constraints. The data in the figure comes from a numerical example of emergency braking control, and details can be found in Chapter \ref{['sec: experiments']}.
  • Figure 2: Real-time and virtual-time trajectories of an MPC controller in an emergency braking task. The real-time trajectories are the same across the four figures. The virtual-time trajectories starts at different real-time steps. The gray-shaded area represents the constraint-violating set. The red-shaded area represents the set where there exists a policy to ensure infinite-horizon safety. A state with a red cross indicates violation of the real-time constraint.
  • Figure 3: State trajectories and state feasibility of MPC under pointwise constraints. The circles stand for the states at every step on the trajectories, which are obtained by solving problem \ref{['eq: MPC formulation']} in an receding-horizon manner. The squares and diamonds stand for regular states so as to show different feasibility regions. Regardless of the shape, all red marks represent endlessly feasible states, all blue marks for initially feasible states, and all gray marks for infeasible states. The purple-shaded areas represent real-time constraint.
  • Figure 4: Illustration of containment relationships.
  • Figure 5: Feasibility function defined through control invariant set.
  • ...and 21 more figures

Theorems & Definitions (28)

  • Definition 3.1: Infeasibility
  • Definition 4.1: Initial feasibility of a state
  • Definition 4.2: Initial feasibility of a policy
  • Definition 4.3: Endless feasibility of a state
  • Definition 4.4: Endless feasibility of a policy
  • Definition 4.5
  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • ...and 18 more