Table of Contents
Fetching ...

SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation

Aodi Wu, Jianhong Zuo, Zeyuan Zhao, Xubo Luo, Ruisuo Wang, Xue Wan

TL;DR

A large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data and identifies two key findings: (i) scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research.

Abstract

Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time-synchronized 1024$\times$1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB--LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense-Bench.

SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation

TL;DR

A large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data and identifies two key findings: (i) scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research.

Abstract

Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time-synchronized 10241024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB--LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense-Bench.
Paper Structure (17 sections, 5 figures, 2 tables)

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of SpaceSense-Bench.Top: 136 diverse satellite models rendered in a high-fidelity space environment built with Unreal Engine 5. Bottom left: the on-orbit servicing scenario in which a servicing spacecraft perceives a non-cooperative target. Bottom middle: time-synchronized multi-modal data (RGB image, LiDAR point cloud, and depth map) provided for every frame. Bottom right: the six downstream tasks supported by the benchmark, including 2D/3D part segmentation, object detection, 6-DoF pose estimation, depth estimation, multi-modal fusion, and visual navigation.
  • Figure 2: Overall data collection pipeline of SpaceSense-Bench. The pipeline consists of four stages: (1) 3D asset library construction and part decomposition, (2) high-fidelity space scene setup, (3) adaptive trajectory planning and multi-sensor synchronized capture, and (4) automated ground truth generation, quality control, and mainstream format export.
  • Figure 3: Multi-modal data and ground truth examples from SpaceSense-Bench. Each column shows one satellite. From top to bottom: RGB image with 6D pose axes overlay, seven-class semantic segmentation mask, LiDAR point cloud with per-point semantic labels, and colorized depth map.
  • Figure 4: Dataset statistics. (a) Per-class foreground pixel ratio showing a long-tail distribution. (b) LiDAR point count versus target distance. (c) Satellite maximum dimension distribution (log scale). (d) Frame count per trajectory type for 22 approach and 5 orbit trajectories.
  • Figure 5: Effect of training set size on zero-shot generalization. mIoU and mAcc of PMFNet on the 14-satellite test set as the number of training satellites increases from $\sim$9 to 117.