Table of Contents
Fetching ...

Revisiting Space Mission Planning: A Reinforcement Learning-Guided Approach for Multi-Debris Rendezvous

Agni Bandyopadhyay, Guenther Waxenegger-Wilfing

Abstract

This research introduces a novel application of a masked Proximal Policy Optimization (PPO) algorithm from the field of deep reinforcement learning (RL), for determining the most efficient sequence of space debris visitation, utilizing the Lambert solver as per Izzo's adaptation for individual rendezvous. The aim is to optimize the sequence in which all the given debris should be visited to get the least total time for rendezvous for the entire mission. A neural network (NN) policy is developed, trained on simulated space missions with varying debris fields. After training, the neural network calculates approximately optimal paths using Izzo's adaptation of Lambert maneuvers. Performance is evaluated against standard heuristics in mission planning. The reinforcement learning approach demonstrates a significant improvement in planning efficiency by optimizing the sequence for debris rendezvous, reducing the total mission time by an average of approximately {10.96\%} and {13.66\%} compared to the Genetic and Greedy algorithms, respectively. The model on average identifies the most time-efficient sequence for debris visitation across various simulated scenarios with the fastest computational speed. This approach signifies a step forward in enhancing mission planning strategies for space debris clearance.

Revisiting Space Mission Planning: A Reinforcement Learning-Guided Approach for Multi-Debris Rendezvous

Abstract

This research introduces a novel application of a masked Proximal Policy Optimization (PPO) algorithm from the field of deep reinforcement learning (RL), for determining the most efficient sequence of space debris visitation, utilizing the Lambert solver as per Izzo's adaptation for individual rendezvous. The aim is to optimize the sequence in which all the given debris should be visited to get the least total time for rendezvous for the entire mission. A neural network (NN) policy is developed, trained on simulated space missions with varying debris fields. After training, the neural network calculates approximately optimal paths using Izzo's adaptation of Lambert maneuvers. Performance is evaluated against standard heuristics in mission planning. The reinforcement learning approach demonstrates a significant improvement in planning efficiency by optimizing the sequence for debris rendezvous, reducing the total mission time by an average of approximately {10.96\%} and {13.66\%} compared to the Genetic and Greedy algorithms, respectively. The model on average identifies the most time-efficient sequence for debris visitation across various simulated scenarios with the fastest computational speed. This approach signifies a step forward in enhancing mission planning strategies for space debris clearance.
Paper Structure (20 sections, 3 equations, 6 figures, 2 tables)

This paper contains 20 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An example problem where two debris rendezvous are conducted. Our spacecraft uses $X_I$ to rendezvous with the first debris. $X_{II}$ represents both the impulses applied on the Spacecraft at the same instant one to rendezvous with Debris 1 and the other one to start the next rendezvous maneuver for Debris 2. $X_{III}$ is the retardation impulse applied to our spacecraft to rendezvous with Debris 2.
  • Figure 2: A classical two-impulse debris rendezvous maneuver is demonstrated. It comprises two impulses one for entering the transfer orbit($X_I$) and the other to complete the rendezvous and stay in orbit with the debris($X_{II}$). The point of rendezvous is represented by a multi-coloured orb as the rendezvous means that the spacecraft/satellite occupy the same space in two dimensional representation.
  • Figure 3: Flowchart of a typical RL algorithm sharma2020reinforcement. This demonstrates how an agent in an RL algorithm learns over time how its actions affects the overall environment and learns to adapt over time to maximize the reward.
  • Figure 4: Predicted total time to rendezvous (TTR) for all algorithms for all the test cases
  • Figure 5: Cumulative reward per episode for the Masked PPO algorithm (from Tensorboard log data) is shown here. The goal is to increase this reward over time but also strive towards a deterministic value at the end.
  • ...and 1 more figures