Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

Robert Reed; Hanspeter Schaub; Morteza Lahijanian

Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

Robert Reed, Hanspeter Schaub, Morteza Lahijanian

TL;DR

The paper tackles safe autonomous spacecraft tasking by integrating Shielded Deep Reinforcement Learning (SDRL) with formal methods. It formalizes Earth-observing tasks and safety via co-safe and safe Linear Temporal Logic (LTL), derives reward signals from LTL specifications using a deterministic finite automaton (DFA) and a product MDP, and constructs a Safety MDP to build three probabilistic shields (One-Step, Two-Step, Q-optimal). Empirical studies in Basilisk demonstrate that training with both liveness and safety specifications yields higher task satisfaction and vastly reduced safety violations, with shields providing robust protection during deployment. The results underscore the value of combining formal specifications with learning to achieve correct-by-design behavior in high-stakes space missions, while also highlighting conservatism in safety abstractions and opportunities for tighter guarantees. Overall, the approach offers a scalable pathway to provably safer autonomous spacecraft operation, balancing task performance with explicit safety guarantees.

Abstract

Autonomous spacecraft control via Shielded Deep Reinforcement Learning (SDRL) has become a rapidly growing research area. However, the construction of shields and the definition of tasking remains informal, resulting in policies with no guarantees on safety and ambiguous goals for the RL agent. In this paper, we first explore the use of formal languages, namely Linear Temporal Logic (LTL), to formalize spacecraft tasks and safety requirements. We then define a manner in which to construct a reward function from a co-safe LTL specification automatically for effective training in SDRL framework. We also investigate methods for constructing a shield from a safe LTL specification for spacecraft applications and propose three designs that provide probabilistic guarantees. We show how these shields interact with different policies and the flexibility of the reward structure through several experiments.

Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

TL;DR

Abstract

Paper Structure (18 sections, 16 equations, 2 figures, 2 tables)

This paper contains 18 sections, 16 equations, 2 figures, 2 tables.

Introduction
Problem Formulation
Spacecraft Model
LTL for Earth Observing Tasks and Safety Requirements
Problem Statement
Reward and Shield Design for DRL
Rewards for DRL with LTL Specifications
Shield Design
One-Step Safety
Two-Step Safety
Q-optimal Safety
Case Studies
Simple Task: Importance of $\varphi_S$ in Training
Complex Tasks and Shielding
Trained and Deployed without Shielding
...and 3 more sections

Figures (2)

Figure 1: Post-Posed Shielded RL architecture.
Figure 2: Action history and reaction wheel speeds when deploying under policy $\pi_0$ (top) and $\pi_1$ (bottom) from a fixed initial condition. The red highlight shows when the spacecraft has access to the target, the blue highlights show when the spacecraft is in Momentum Dumping (RW Desat) Mode. Note that policy $\pi_1$ (trained in $\varphi_{0L} \land \varphi_S$) keeps the spacecraft safe after imaging the target whereas policy $\pi_0$ (trained on only $\varphi_{0L}$) prioritizes imaging over spacecraft survival.

Theorems & Definitions (12)

Definition 1: MDP
Example 1
Definition 2: Policy
Definition 3: Co-safe LTL
Definition 4: Safe LTL
Example 2
Remark 1
Definition 5: DFA
Definition 6: Product MDP
Definition 7: Safety MDP
...and 2 more

Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

TL;DR

Abstract

Shielded Deep Reinforcement Learning for Complex Spacecraft Tasking

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (12)