Table of Contents
Fetching ...

Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

Robert Reed, Hanspeter Schaub, Morteza Lahijanian

TL;DR

The paper tackles safe autonomous spacecraft tasking by integrating Shielded Deep Reinforcement Learning (SDRL) with formal methods. It formalizes Earth-observing tasks and safety via co-safe and safe Linear Temporal Logic (LTL), derives reward signals from LTL specifications using a deterministic finite automaton (DFA) and a product MDP, and constructs a Safety MDP to build three probabilistic shields (One-Step, Two-Step, Q-optimal). Empirical studies in Basilisk demonstrate that training with both liveness and safety specifications yields higher task satisfaction and vastly reduced safety violations, with shields providing robust protection during deployment. The results underscore the value of combining formal specifications with learning to achieve correct-by-design behavior in high-stakes space missions, while also highlighting conservatism in safety abstractions and opportunities for tighter guarantees. Overall, the approach offers a scalable pathway to provably safer autonomous spacecraft operation, balancing task performance with explicit safety guarantees.

Abstract

Autonomous spacecraft control via Shielded Deep Reinforcement Learning (SDRL) has become a rapidly growing research area. However, the construction of shields and the definition of tasking remains informal, resulting in policies with no guarantees on safety and ambiguous goals for the RL agent. In this paper, we first explore the use of formal languages, namely Linear Temporal Logic (LTL), to formalize spacecraft tasks and safety requirements. We then define a manner in which to construct a reward function from a co-safe LTL specification automatically for effective training in SDRL framework. We also investigate methods for constructing a shield from a safe LTL specification for spacecraft applications and propose three designs that provide probabilistic guarantees. We show how these shields interact with different policies and the flexibility of the reward structure through several experiments.

Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

TL;DR

The paper tackles safe autonomous spacecraft tasking by integrating Shielded Deep Reinforcement Learning (SDRL) with formal methods. It formalizes Earth-observing tasks and safety via co-safe and safe Linear Temporal Logic (LTL), derives reward signals from LTL specifications using a deterministic finite automaton (DFA) and a product MDP, and constructs a Safety MDP to build three probabilistic shields (One-Step, Two-Step, Q-optimal). Empirical studies in Basilisk demonstrate that training with both liveness and safety specifications yields higher task satisfaction and vastly reduced safety violations, with shields providing robust protection during deployment. The results underscore the value of combining formal specifications with learning to achieve correct-by-design behavior in high-stakes space missions, while also highlighting conservatism in safety abstractions and opportunities for tighter guarantees. Overall, the approach offers a scalable pathway to provably safer autonomous spacecraft operation, balancing task performance with explicit safety guarantees.

Abstract

Autonomous spacecraft control via Shielded Deep Reinforcement Learning (SDRL) has become a rapidly growing research area. However, the construction of shields and the definition of tasking remains informal, resulting in policies with no guarantees on safety and ambiguous goals for the RL agent. In this paper, we first explore the use of formal languages, namely Linear Temporal Logic (LTL), to formalize spacecraft tasks and safety requirements. We then define a manner in which to construct a reward function from a co-safe LTL specification automatically for effective training in SDRL framework. We also investigate methods for constructing a shield from a safe LTL specification for spacecraft applications and propose three designs that provide probabilistic guarantees. We show how these shields interact with different policies and the flexibility of the reward structure through several experiments.
Paper Structure (18 sections, 16 equations, 2 figures, 2 tables)

This paper contains 18 sections, 16 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Post-Posed Shielded RL architecture.
  • Figure 2: Action history and reaction wheel speeds when deploying under policy $\pi_0$ (top) and $\pi_1$ (bottom) from a fixed initial condition. The red highlight shows when the spacecraft has access to the target, the blue highlights show when the spacecraft is in Momentum Dumping (RW Desat) Mode. Note that policy $\pi_1$ (trained in $\varphi_{0L} \land \varphi_S$) keeps the spacecraft safe after imaging the target whereas policy $\pi_0$ (trained on only $\varphi_{0L}$) prioritizes imaging over spacecraft survival.

Theorems & Definitions (12)

  • Definition 1: MDP
  • Example 1
  • Definition 2: Policy
  • Definition 3: Co-safe LTL
  • Definition 4: Safe LTL
  • Example 2
  • Remark 1
  • Definition 5: DFA
  • Definition 6: Product MDP
  • Definition 7: Safety MDP
  • ...and 2 more