Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking
Robert Reed, Hanspeter Schaub, Morteza Lahijanian
TL;DR
The paper tackles safe autonomous spacecraft tasking by integrating Shielded Deep Reinforcement Learning (SDRL) with formal methods. It formalizes Earth-observing tasks and safety via co-safe and safe Linear Temporal Logic (LTL), derives reward signals from LTL specifications using a deterministic finite automaton (DFA) and a product MDP, and constructs a Safety MDP to build three probabilistic shields (One-Step, Two-Step, Q-optimal). Empirical studies in Basilisk demonstrate that training with both liveness and safety specifications yields higher task satisfaction and vastly reduced safety violations, with shields providing robust protection during deployment. The results underscore the value of combining formal specifications with learning to achieve correct-by-design behavior in high-stakes space missions, while also highlighting conservatism in safety abstractions and opportunities for tighter guarantees. Overall, the approach offers a scalable pathway to provably safer autonomous spacecraft operation, balancing task performance with explicit safety guarantees.
Abstract
Autonomous spacecraft control via Shielded Deep Reinforcement Learning (SDRL) has become a rapidly growing research area. However, the construction of shields and the definition of tasking remains informal, resulting in policies with no guarantees on safety and ambiguous goals for the RL agent. In this paper, we first explore the use of formal languages, namely Linear Temporal Logic (LTL), to formalize spacecraft tasks and safety requirements. We then define a manner in which to construct a reward function from a co-safe LTL specification automatically for effective training in SDRL framework. We also investigate methods for constructing a shield from a safe LTL specification for spacecraft applications and propose three designs that provide probabilistic guarantees. We show how these shields interact with different policies and the flexibility of the reward structure through several experiments.
