Safety-Critical Reinforcement Learning with Viability-Based Action Shielding for Hypersonic Longitudinal Flight
Hossein Rastgoftar
TL;DR
This work tackles safety-critical reinforcement learning for a nonlinear hypersonic longitudinal flight model with continuous state and control spaces, enforcing hard constraints via a viability-based action shield. It presents a hybrid MDP framework that combines offline reachability analysis to construct a state-feasible set $\\\mathcal{S}_{\text{feas}}$ and a mode-dependent reward structure anchored by a safety box $\\\mathcal{B}_{\text{safe}}$, ensuring forward invariance through a state-dependent admissible-action mask $\\mathcal{A}_{\text{mask}}(s)$. Learning is performed with mask-consistent tabular Q-learning, using a local action neighborhood to maintain smooth control while policies are constrained to admissible actions; episode chaining propagates terminal states to expose long-horizon recovery dynamics without violating feasibility. The approach is demonstrated on a high-fidelity hypersonic model with coupling between aerodynamics and propulsion, achieving recovery from unsafe conditions and sustained operation within hard constraints, while preserving interpretability and data efficiency. Overall, the work integrates reachability, shielding, and hybrid MDP concepts to provide formal safety guarantees for RL in aerospace control applications.
Abstract
This paper presents a safety-critical reinforcement learning framework for nonlinear dynamical systems with continuous state and input spaces operating under explicit physical constraints. Hard safety constraints are enforced independently of the reward through action shielding and reachability-based admissible action sets, ensuring that unsafe behaviors are never intentionally selected during learning or execution. To capture nominal operation and recovery behavior within a single control architecture, the state space is partitioned into safe and unsafe regions based on membership in a safety box, and a mode-dependent reward is used to promote accurate tracking inside the safe region and recovery toward it when operating outside. To enable online tabular learning on continuous dynamics, a finite-state abstraction is constructed via state aggregation, and action selection and value updates are consistently restricted to admissible actions. The framework is demonstrated on a longitudinal point-mass hypersonic vehicle model with aerodynamic and propulsion couplings, using angle of attack and throttle as control inputs.
