Table of Contents
Fetching ...

FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation

Pierre Ancey, Andrew Price, Saqib Javed, Mathieu Salzmann

TL;DR

The paper tackles real-time, monocular 6DoF spacecraft pose estimation for edge devices by introducing FastPose-ViT, a Vision Transformer-based direct regressor that operates on cropped object images. It couples a geometric reformulation with an apparent-rotation correction to recover full-frame translation and orientation from crop-space predictions, achieving competitive accuracy with non-PnP methods and real-time edge deployment on Jetson hardware. Key contributions include a crop-aware translation/rotation target design (Uz, Ux, Uy) and a closed-form rotation correction, extensive data augmentations, and an end-to-end deployment pipeline featuring FP16 TensorRT quantization. Experiments on SPEED and SPEED+ demonstrate strong performance and robust ablations validate the importance of pretraining, cropping, and the proposed geometric targets, while revealing remaining challenges under large domain gaps.

Abstract

Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of "apparent rotation", where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model's suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.

FastPose-ViT: A Vision Transformer for Real-Time Spacecraft Pose Estimation

TL;DR

The paper tackles real-time, monocular 6DoF spacecraft pose estimation for edge devices by introducing FastPose-ViT, a Vision Transformer-based direct regressor that operates on cropped object images. It couples a geometric reformulation with an apparent-rotation correction to recover full-frame translation and orientation from crop-space predictions, achieving competitive accuracy with non-PnP methods and real-time edge deployment on Jetson hardware. Key contributions include a crop-aware translation/rotation target design (Uz, Ux, Uy) and a closed-form rotation correction, extensive data augmentations, and an end-to-end deployment pipeline featuring FP16 TensorRT quantization. Experiments on SPEED and SPEED+ demonstrate strong performance and robust ablations validate the importance of pretraining, cropping, and the proposed geometric targets, while revealing remaining challenges under large domain gaps.

Abstract

Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of "apparent rotation", where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model's suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.

Paper Structure

This paper contains 31 sections, 21 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Overview of the FastPose-ViT pipeline. A bounding-box detector first extracts the spacecraft from the full image, producing a cropped input for the pose network. Then, (1) a ViT-based architecture regresses intermediate pose parameters from the crop; (2) geometric recovery formulas convert these into the full-frame translation and rotation; (3) the trained model is exported, quantized, and optimized for real-time deployment on a NVIDIA Jetson Orin Nano.
  • Figure 2: Apparent Rotation ($R'$). An object's perceived orientation changes with its position in the image due to the camera perspective. The apparent rotation $R'$ is the true orientation a centered object must have to best represent the off-center object.
  • Figure 3: Qualitative Results on SPEED and SPEED+. We visualize our best model’s performance on the test sets, showing examples of the model's best, average, and worst predictions. Ground-truth (GT) axes are in solid primary colors, while predicted (Pred) axes are shown as lighter, dashed lines. Errors below each image are reported as Translation Error [m] $\vert$ Rotation Error [deg].
  • Figure 4: Impact of bounding box quality on pose estimation. Left: a poor bounding box prediction (red) compared to the ground truth (green) leading to erroneous pose regression. Right: a typical bounding box prediction (red) closely aligned with the ground truth (green) resulting in stable pose estimates.