Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

Hossein Rastgoftar

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

Hossein Rastgoftar

TL;DR

This work tackles safety-critical reinforcement learning for a nonlinear hypersonic longitudinal flight model with continuous state and control spaces, enforcing hard constraints via a viability-based action shield. It presents a hybrid MDP framework that combines offline reachability analysis to construct a state-feasible set $\\\mathcal{S}_{\text{feas}}$ and a mode-dependent reward structure anchored by a safety box $\\\mathcal{B}_{\text{safe}}$, ensuring forward invariance through a state-dependent admissible-action mask $\\mathcal{A}_{\text{mask}}(s)$. Learning is performed with mask-consistent tabular Q-learning, using a local action neighborhood to maintain smooth control while policies are constrained to admissible actions; episode chaining propagates terminal states to expose long-horizon recovery dynamics without violating feasibility. The approach is demonstrated on a high-fidelity hypersonic model with coupling between aerodynamics and propulsion, achieving recovery from unsafe conditions and sustained operation within hard constraints, while preserving interpretability and data efficiency. Overall, the work integrates reachability, shielding, and hybrid MDP concepts to provide formal safety guarantees for RL in aerospace control applications.

Abstract

This paper presents a safety-critical reinforcement learning framework for nonlinear dynamical systems with continuous state and input spaces operating under explicit physical constraints. Hard safety constraints are enforced independently of the reward through action shielding and reachability-based admissible action sets, ensuring that unsafe behaviors are never intentionally selected during learning or execution. To capture nominal operation and recovery behavior within a single control architecture, the state space is partitioned into safe and unsafe regions based on membership in a safety box, and a mode-dependent reward is used to promote accurate tracking inside the safe region and recovery toward it when operating outside. To enable online tabular learning on continuous dynamics, a finite-state abstraction is constructed via state aggregation, and action selection and value updates are consistently restricted to admissible actions. The framework is demonstrated on a longitudinal point-mass hypersonic vehicle model with aerodynamic and propulsion couplings, using angle of attack and throttle as control inputs.

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

TL;DR

and a mode-dependent reward structure anchored by a safety box

, ensuring forward invariance through a state-dependent admissible-action mask

. Learning is performed with mask-consistent tabular Q-learning, using a local action neighborhood to maintain smooth control while policies are constrained to admissible actions; episode chaining propagates terminal states to expose long-horizon recovery dynamics without violating feasibility. The approach is demonstrated on a high-fidelity hypersonic model with coupling between aerodynamics and propulsion, achieving recovery from unsafe conditions and sustained operation within hard constraints, while preserving interpretability and data efficiency. Overall, the work integrates reachability, shielding, and hybrid MDP concepts to provide formal safety guarantees for RL in aerospace control applications.

Abstract

Paper Structure (14 sections, 56 equations, 3 figures, 5 tables)

This paper contains 14 sections, 56 equations, 3 figures, 5 tables.

Introduction
Related Work
Contributions
Outline
Problem Statement
Hypersonic Vehicle Model
Safety and Operational Constraints
Hybrid Control Framework
Dynamics Discretization, State Abstraction, and Reward Design
RL with Shielding and Episode Chaining
Simulation Environment and Conditions
Reward Structure
Plots
Conclusion and Future Work

Figures (3)

Figure 1: Contours for (a) drag coefficient, (b) lift coefficient, (c) specific impulse, and (d) maximum thrust.
Figure 2: State trajectories starting from an initial condition outside the safety box. Altitude $h$, speed $V$, and flight--path angle $\gamma$ are shown. Red dashed lines indicate the safety--box bounds. The trajectory is driven back into the safe region and converges toward the nominal cruise condition while remaining within hard physical constraints.
Figure 3: Control inputs and reward evolution. Angle--of--attack and throttle commands remain within the admissible discrete action set. The reward increases as the state approaches the safety box and stabilizes during nominal cruise operation.

Theorems & Definitions (4)

Definition 1: One-step hard-safe action
Definition 2: State aggregation and representative states
Definition 3: Viable feasible set
Definition 4: Admissible action set

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

TL;DR

Abstract

Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (4)