Table of Contents
Fetching ...

SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors

Ruijie Fan, Junyan Ye, Huan Chen, Zilong Huang, Xiaolei Wang, Weijia Li

TL;DR

SatSAM2 addresses the challenge of robust, zero-shot satellite video tracking under occlusion, drift, and domain shifts by fusing SAM2-based segmentation with a Kalman-filtered motion prior and a motion-constrained state machine. The method introduces a SAM2 Observer for frame-wise memory-augmented feature extraction, a Kalman Filter-based Constrained Motion Module to stabilize motion during occlusion, and a five-state MCSM to adapt the tracking strategy in real time. A large synthetic MVOT dataset is proposed to provide diverse illumination, viewpoint, and occlusion conditions, supporting scalable evaluation and robustness analysis. Experiments demonstrate that SatSAM2 outperforms both traditional trackers and foundation-model-based approaches, including significant improvements on challenging datasets, establishing a new baseline for remote-sensing video tracking in zero-shot settings.

Abstract

Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.

SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors

TL;DR

SatSAM2 addresses the challenge of robust, zero-shot satellite video tracking under occlusion, drift, and domain shifts by fusing SAM2-based segmentation with a Kalman-filtered motion prior and a motion-constrained state machine. The method introduces a SAM2 Observer for frame-wise memory-augmented feature extraction, a Kalman Filter-based Constrained Motion Module to stabilize motion during occlusion, and a five-state MCSM to adapt the tracking strategy in real time. A large synthetic MVOT dataset is proposed to provide diverse illumination, viewpoint, and occlusion conditions, supporting scalable evaluation and robustness analysis. Experiments demonstrate that SatSAM2 outperforms both traditional trackers and foundation-model-based approaches, including significant improvements on challenging datasets, establishing a new baseline for remote-sensing video tracking in zero-shot settings.

Abstract

Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.

Paper Structure

This paper contains 40 sections, 9 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of satellite video object tracking (SVOT). (a) Challenges in satellite-based tracking tasks. (b) Existing promptable methods either lack motion modeling or fail to account for the complete tracking pipeline. (c) Our approach integrates a Kalman-based motion model with a motion-constrained state machine to enable stable tracking.
  • Figure 2: Overview of the proposed SatSAM2 framework. (a) SAM2 Observer encodes each frame and retrieves candidate masks via memory matching. (b) Kalman Filter-based Constrained Motion Model (KFCMM) estimates target dynamics and provides predictive guidance under occlusion. (c) Motion-Constrained State Machine (MCSM) adaptively switches between tracking modes based on segmentation confidence and motion consistency.
  • Figure 3: Illustration of the MVOT dataset.
  • Figure 4: Qualitative comparison on four remote sensing datasets. Our method achieves the most accurate tracking results and demonstrates strong robustness in recovering targets after occlusion, enabling fast and reliable re-alignment.
  • Figure 5: Ablation comparison between Ours, GT, and variants with different modules removed.
  • ...and 7 more figures