Sensing for Space Safety and Sustainability: A Deep Learning Approach with Vision Transformers
Wenxuan Zhang, Peng Hu
TL;DR
This work tackles onboard SOD for space safety in low Earth orbit by introducing GELAN-ViT and GELAN-RepViT, two detectors that separate CNN-based local feature extraction from Vision Transformer-based global context. Built on the GELAN backbone, these models fuse parallel CNN and ViT paths via a fusion stage, achieving competitive accuracy while substantially reducing GFLOPs compared with YOLOv9-t. On SODD and VOC 2012, they show mAP performance near the top of the tested set with about a 5× reduction in computation, supporting real-time, power-efficient operation on small satellites. The separation of local and global feature pathways demonstrates a practical strategy for robust SOD in space, with implications for collision risk assessment and long-term space sustainability.
Abstract
The rapid increase of space assets represented by small satellites in low Earth orbit can enable ubiquitous digital services for everyone. However, due to the dynamic space environment, numerous space objects, complex atmospheric conditions, and unexpected events can easily introduce adverse conditions affecting space safety, operations, and sustainability of the outer space environment. This challenge calls for responsive, effective satellite object detection (SOD) solutions that allow a small satellite to assess and respond to collision risks, with the consideration of constrained resources on a small satellite platform. This paper discusses the SOD tasks and onboard deep learning (DL) approach to the tasks. Two new DL models are proposed, called GELAN-ViT and GELAN-RepViT, which incorporate vision transformer (ViT) into the Generalized Efficient Layer Aggregation Network (GELAN) architecture and address limitations by separating the convolutional neural network and ViT paths. These models outperform the state-of-the-art YOLOv9-t in terms of mean average precision (mAP) and computational costs. On the SOD dataset, our proposed models can achieve around 95% mAP50 with giga-floating point operations (GFLOPs) reduced by over 5.0. On the VOC 2012 dataset, they can achieve $\geq$ 60.7% mAP50 with GFLOPs reduced by over 5.2.
