Table of Contents
Fetching ...

PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images

Nanqing Liu, Xun Xu, Yongyi Su, Haojie Zhang, Heng-Chao Li

TL;DR

PointSAM addresses the domain gap between remote sensing and natural images by fine-tuning SAM with only point annotations through a self-training framework. It introduces Prototype-based Regularization (PBR) to align target and predicted prototypes via Hungarian matching, and Negative Prompt Calibration (NPC) to refine masks in densely packed RSI scenes; these components are supported by offline FINCH-based prototype generation, a FIFO memory bank for online prototypes, and LoRA-based encoder fine-tuning. The approach yields consistent, state-of-the-art gains across NWPU VHR-10, WHU, and HRSID-inshore datasets, significantly narrowing the gap to fully supervised methods and even enabling a point-to-box pathway for rotated-object detection. The results demonstrate the practical viability of point-based supervision for RSI segmentation and suggest broader applicability to other point-supervised tasks.

Abstract

Segment Anything Model (SAM) is an advanced foundational model for image segmentation, which is gradually being applied to remote sensing images (RSIs). Due to the domain gap between RSIs and natural images, traditional methods typically use SAM as a source pre-trained model and fine-tune it with fully supervised masks. Unlike these methods, our work focuses on fine-tuning SAM using more convenient and challenging point annotations. Leveraging SAM's zero-shot capabilities, we adopt a self-training framework that iteratively generates pseudo-labels for training. However, if the pseudo-labels contain noisy labels, there is a risk of error accumulation. To address this issue, we extract target prototypes from the target dataset and use the Hungarian algorithm to match them with prediction prototypes, preventing the model from learning in the wrong direction. Additionally, due to the complex backgrounds and dense distribution of objects in RSI, using point prompts may result in multiple objects being recognized as one. To solve this problem, we propose a negative prompt calibration method based on the non-overlapping nature of instance masks. In brief, we use the prompts of overlapping masks as corresponding negative signals, resulting in refined masks. Combining the above methods, we propose a novel Pointly-supervised Segment Anything Model named PointSAM. We conduct experiments on RSI datasets, including WHU, HRSID, and NWPU VHR-10, and the results show that our method significantly outperforms direct testing with SAM, SAM2, and other comparison methods. Furthermore, we introduce PointSAM as a point-to-box converter and achieve encouraging results, suggesting that this method can be extended to other point-supervised tasks. The code is available at https://github.com/Lans1ng/PointSAM.

PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images

TL;DR

PointSAM addresses the domain gap between remote sensing and natural images by fine-tuning SAM with only point annotations through a self-training framework. It introduces Prototype-based Regularization (PBR) to align target and predicted prototypes via Hungarian matching, and Negative Prompt Calibration (NPC) to refine masks in densely packed RSI scenes; these components are supported by offline FINCH-based prototype generation, a FIFO memory bank for online prototypes, and LoRA-based encoder fine-tuning. The approach yields consistent, state-of-the-art gains across NWPU VHR-10, WHU, and HRSID-inshore datasets, significantly narrowing the gap to fully supervised methods and even enabling a point-to-box pathway for rotated-object detection. The results demonstrate the practical viability of point-based supervision for RSI segmentation and suggest broader applicability to other point-supervised tasks.

Abstract

Segment Anything Model (SAM) is an advanced foundational model for image segmentation, which is gradually being applied to remote sensing images (RSIs). Due to the domain gap between RSIs and natural images, traditional methods typically use SAM as a source pre-trained model and fine-tune it with fully supervised masks. Unlike these methods, our work focuses on fine-tuning SAM using more convenient and challenging point annotations. Leveraging SAM's zero-shot capabilities, we adopt a self-training framework that iteratively generates pseudo-labels for training. However, if the pseudo-labels contain noisy labels, there is a risk of error accumulation. To address this issue, we extract target prototypes from the target dataset and use the Hungarian algorithm to match them with prediction prototypes, preventing the model from learning in the wrong direction. Additionally, due to the complex backgrounds and dense distribution of objects in RSI, using point prompts may result in multiple objects being recognized as one. To solve this problem, we propose a negative prompt calibration method based on the non-overlapping nature of instance masks. In brief, we use the prompts of overlapping masks as corresponding negative signals, resulting in refined masks. Combining the above methods, we propose a novel Pointly-supervised Segment Anything Model named PointSAM. We conduct experiments on RSI datasets, including WHU, HRSID, and NWPU VHR-10, and the results show that our method significantly outperforms direct testing with SAM, SAM2, and other comparison methods. Furthermore, we introduce PointSAM as a point-to-box converter and achieve encouraging results, suggesting that this method can be extended to other point-supervised tasks. The code is available at https://github.com/Lans1ng/PointSAM.
Paper Structure (36 sections, 14 equations, 11 figures, 7 tables)

This paper contains 36 sections, 14 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Training pipeline of vanilla SAM. (b) Training pipeline of self-training based pointly-supervised SAM. Sup. means supervise.
  • Figure 2: Segmentation results on the NWPU VHR-10, WHU, and HRSID datasets. (a) Segmentation results using only positive prompts. (b) Segmentation results using both positive and negative prompts.
  • Figure 3: Overall architecture of the proposed PointSAM. (a) Offline prototype generation. First, feature points are obtained from the target domain dataset using the encoder of the frozen Source SAM model, and then clustering is applied to these features to obtain the target domain prototypes. (b) SAM with self-training. The training images undergo strong augmentation and weak augmentation, and are then processed through two encoders with shared weights: the teacher and the student. The original layers of the encoder are frozen, and Low-Rank Adaptation (LoRA) is used for fine-tuning. Calibration refers to Negative Prompt Calibration, which is used to obtain refined masks by adjusting the negative prompts. Matching refers to Hungarian matching, which is used to align predicted prototypes with target prototypes.
  • Figure 4: The process of negative prompt calibration. The positive and negative prompts are represented by red points (•) and green points (•), respectively. Different prompts input into SAM generates different initial masks. To refine these masks, an IoU matrix is calculated for each instance pair. Matrix values greater than 0 indicate that the two objects can act as negative constraints for each other. By using the positive prompt of one object as the new negative prompt for another and inputting it into SAM again, a refined mask is generated. It is worth noting that Ground Truth here refers to the mask specified by the prompt for a specific instance, not the mask for all instances.
  • Figure 5: The impact of different thresholds of IoU on the HRSID-inshore dataset with 1-point.
  • ...and 6 more figures