Table of Contents
Fetching ...

OBSeg: Accurate and Fast Instance Segmentation Framework Using Segmentation Foundation Models with Oriented Bounding Box Prompts

Zhen Zhou, Junfeng Fan, Yunkai Ma, Sihan Zhao, Fengshui Jing, Min Tan

TL;DR

This work tackles the challenge of accurate instance segmentation in remote sensing by shifting from segmentation-within-OBB boxes to using oriented bounding box prompts with segmentation foundation models. The authors introduce OBSeg, a four-module framework featuring an OBB prompt encoder and a Gaussian smoothing-based knowledge distillation pipeline to produce lightweight, high-performance masks. By decoupling segmentation from strict OBB detection quality through OBB prompts, OBSeg achieves state-of-the-art accuracy on iSAID, NWPU VHR-10, and PSeg-SSDD with competitive inference speed. The approach demonstrates robust performance across datasets and offers practical benefits for large-scale, oriented-object segmentation in aerial imagery.

Abstract

Instance segmentation in remote sensing images is a long-standing challenge. Since horizontal bounding boxes introduce many interference objects, oriented bounding boxes (OBBs) are usually used for instance identification. However, based on ``segmentation within bounding box'' paradigm, current instance segmentation methods using OBBs are overly dependent on bounding box detection performance. To tackle this problem, this paper proposes OBSeg, an accurate and fast instance segmentation framework using OBBs. OBSeg is based on box prompt-based segmentation foundation models (BSMs), e.g., Segment Anything Model. Specifically, OBSeg first detects OBBs to distinguish instances and provide coarse localization information. Then, it predicts OBB prompt-related masks for fine segmentation. Since OBBs only serve as prompts, OBSeg alleviates the over-dependence on bounding box detection performance of current instance segmentation methods using OBBs. Thanks to OBB prompts, OBSeg outperforms other current BSM-based methods using HBBs. In addition, to enable BSMs to handle OBB prompts, we propose a novel OBB prompt encoder. To make OBSeg more lightweight and further improve the performance of lightweight distilled BSMs, a Gaussian smoothing-based knowledge distillation method is introduced. Experiments demonstrate that OBSeg outperforms current instance segmentation methods on multiple datasets in terms of instance segmentation accuracy and has competitive inference speed. The code is available at https://github.com/zhen6618/OBBInstanceSegmentation.

OBSeg: Accurate and Fast Instance Segmentation Framework Using Segmentation Foundation Models with Oriented Bounding Box Prompts

TL;DR

This work tackles the challenge of accurate instance segmentation in remote sensing by shifting from segmentation-within-OBB boxes to using oriented bounding box prompts with segmentation foundation models. The authors introduce OBSeg, a four-module framework featuring an OBB prompt encoder and a Gaussian smoothing-based knowledge distillation pipeline to produce lightweight, high-performance masks. By decoupling segmentation from strict OBB detection quality through OBB prompts, OBSeg achieves state-of-the-art accuracy on iSAID, NWPU VHR-10, and PSeg-SSDD with competitive inference speed. The approach demonstrates robust performance across datasets and offers practical benefits for large-scale, oriented-object segmentation in aerial imagery.

Abstract

Instance segmentation in remote sensing images is a long-standing challenge. Since horizontal bounding boxes introduce many interference objects, oriented bounding boxes (OBBs) are usually used for instance identification. However, based on ``segmentation within bounding box'' paradigm, current instance segmentation methods using OBBs are overly dependent on bounding box detection performance. To tackle this problem, this paper proposes OBSeg, an accurate and fast instance segmentation framework using OBBs. OBSeg is based on box prompt-based segmentation foundation models (BSMs), e.g., Segment Anything Model. Specifically, OBSeg first detects OBBs to distinguish instances and provide coarse localization information. Then, it predicts OBB prompt-related masks for fine segmentation. Since OBBs only serve as prompts, OBSeg alleviates the over-dependence on bounding box detection performance of current instance segmentation methods using OBBs. Thanks to OBB prompts, OBSeg outperforms other current BSM-based methods using HBBs. In addition, to enable BSMs to handle OBB prompts, we propose a novel OBB prompt encoder. To make OBSeg more lightweight and further improve the performance of lightweight distilled BSMs, a Gaussian smoothing-based knowledge distillation method is introduced. Experiments demonstrate that OBSeg outperforms current instance segmentation methods on multiple datasets in terms of instance segmentation accuracy and has competitive inference speed. The code is available at https://github.com/zhen6618/OBBInstanceSegmentation.
Paper Structure (29 sections, 17 equations, 6 figures, 7 tables)

This paper contains 29 sections, 17 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: For instance segmentation in remote sensing images, (a): HBB introduces many interference objects. (b): The "segmentation within bounding box" paradigm limits the segmentation to be performed mainly within the detected OBB, making the segmentation performance overly dependent on the OBB detection performance. Once the OBB detection is inaccurate, the mask segmentation will also be affected. (c) The proposed OBSeg only uses OBB as a prompt to guide object segmentation, so the segmentation result is less dependent on OBB detection performance. Although the OBB detection is inaccurate, the mask can be segmented accurately.
  • Figure 2: Architecture of the proposed OBSeg. It is mainly composed of four parts: an OBB detection module, an image encoder, an OBB prompt encoder, and a mask decoder. OBSeg first detects OBBs to distinguish instances, identify classes, and provide coarse localization information. Then, the mask decoder utilizes the image embeddings generated by the image encoder and the OBB prompt embeddings generated by the OBB prompt encoder to generate segmentation masks. In addition, Gaussian smoothing-based knowledge distillation is performed on the OBB prompt encoder and the mask decoder to make OBSeg more lightweight.
  • Figure 3: Architecture of the proposed OBB prompt encoder. The input is an OBB ($x, y, w, h, \theta$), where $(x, y)$, $w$, $h$ and $\theta$ represent the center point, width, height and orientation, respectively.
  • Figure 4: The process of knowledge distillation for the OBB prompt encoder and mask decoder. "TE", "BE" and "OE" represent encoded feature embeddings with respect to the top-left point, bottom-right point and orientation of an OBB, respectively. "GS" stands for Gaussian smoothing.
  • Figure 5: Some visualization results of OBSeg's predictions on the iSAID dataset. Segmentation masks of different colors represent different instances.
  • ...and 1 more figures