Rocket Landing Control with Random Annealing Jump Start Reinforcement Learning
Yuxuan Jiang, Yujie Yang, Zhiqian Lan, Guojian Zhan, Shengbo Eben Li, Qi Sun, Jian Ma, Tianwen Yu, Changwu Zhang
TL;DR
This work tackles goal-oriented rocket landing control under real-time constraints and sparse rewards by introducing Random Annealing Jump Start (RAJS), a framework that combines a fixed guide policy with a learnable exploration policy and anneals the guide horizon upper bound to zero to minimize distribution shift. Implemented on top of Proximal Policy Optimization (PPO), RAJS dramatically improves landing success from 8% to 97% on a high-fidelity rocket model, with additional enhancements such as cascading jump start, refined reward design, early termination, and action smoothing enabling smoother and more realistic control. The approach is validated with extensive simulation and Hardware-in-the-Loop testing, demonstrating real-time feasibility with 10 ms control intervals and robust performance under wind disturbances. The results highlight RAJS as a practical method to enable safe, efficient RL for critical aerospace control tasks and point to future work in integrating safe RL techniques to further ensure pose stability and constraint satisfaction.
Abstract
Rocket recycling is a crucial pursuit in aerospace technology, aimed at reducing costs and environmental impact in space exploration. The primary focus centers on rocket landing control, involving the guidance of a nonlinear underactuated rocket with limited fuel in real-time. This challenging task prompts the application of reinforcement learning (RL), yet goal-oriented nature of the problem poses difficulties for standard RL algorithms due to the absence of intermediate reward signals. This paper, for the first time, significantly elevates the success rate of rocket landing control from 8% with a baseline controller to 97% on a high-fidelity rocket model using RL. Our approach, called Random Annealing Jump Start (RAJS), is tailored for real-world goal-oriented problems by leveraging prior feedback controllers as guide policy to facilitate environmental exploration and policy learning in RL. In each episode, the guide policy navigates the environment for the guide horizon, followed by the exploration policy taking charge to complete remaining steps. This jump-start strategy prunes exploration space, rendering the problem more tractable to RL algorithms. The guide horizon is sampled from a uniform distribution, with its upper bound annealing to zero based on performance metrics, mitigating distribution shift and mismatch issues in existing methods. Additional enhancements, including cascading jump start, refined reward and terminal condition, and action smoothness regulation, further improve policy performance and practical applicability. The proposed method is validated through extensive evaluation and Hardware-in-the-Loop testing, affirming the effectiveness, real-time feasibility, and smoothness of the proposed controller.
