Table of Contents
Fetching ...

GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation

Weiming Zhang, Yexin Liu, Xu Zheng, Lin Wang

TL;DR

GoodSAM tackles transferring segmentation knowledge from the Segment Anything Model to a compact panoramic semantic segmentation model without labeled data. It introduces a teacher assistant to provide semantic cues and two modules, Distortion-Aware Rectification (DAR) and Multi-level Knowledge Adaptation (MKA), to produce distortion-aware ensemble logits and multi-scale transfer to a lightweight student. The approach uses an overlapping sliding window scheme to handle ERP distortion, a cross-task fusion to combine instance masks and semantic labels, and multi-level losses to align the student with TA guidance. Empirical results on DensePASS and WildPASS show state-of-the-art performance with compact parameter budgets, including a 3.7M-parameter tiny variant, and substantial mIoU gains over prior UDA methods.

Abstract

This paper tackles a novel yet challenging problem: how to transfer knowledge from the emerging Segment Anything Model (SAM) -- which reveals impressive zero-shot instance segmentation capacity -- to learn a compact panoramic semantic segmentation model, i.e., student, without requiring any labeled data. This poses considerable challenges due to SAM's inability to provide semantic labels and the large capacity gap between SAM and the student. To this end, we propose a novel framework, called GoodSAM, that introduces a teacher assistant (TA) to provide semantic information, integrated with SAM to generate ensemble logits to achieve knowledge transfer. Specifically, we propose a Distortion-Aware Rectification (DAR) module that first addresses the distortion problem of panoramic images by imposing prediction-level consistency and boundary enhancement. This subtly enhances TA's prediction capacity on panoramic images. DAR then incorporates a cross-task complementary fusion block to adaptively merge the predictions of SAM and TA to obtain more reliable ensemble logits. Moreover, we introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the multi-level feature knowledge from TA and ensemble logits to learn a compact student model. Extensive experiments on two benchmarks show that our GoodSAM achieves a remarkable +3.75\% mIoU improvement over the state-of-the-art (SOTA) domain adaptation methods. Also, our most lightweight model achieves comparable performance to the SOTA methods with only 3.7M parameters.

GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation

TL;DR

GoodSAM tackles transferring segmentation knowledge from the Segment Anything Model to a compact panoramic semantic segmentation model without labeled data. It introduces a teacher assistant to provide semantic cues and two modules, Distortion-Aware Rectification (DAR) and Multi-level Knowledge Adaptation (MKA), to produce distortion-aware ensemble logits and multi-scale transfer to a lightweight student. The approach uses an overlapping sliding window scheme to handle ERP distortion, a cross-task fusion to combine instance masks and semantic labels, and multi-level losses to align the student with TA guidance. Empirical results on DensePASS and WildPASS show state-of-the-art performance with compact parameter budgets, including a 3.7M-parameter tiny variant, and substantial mIoU gains over prior UDA methods.

Abstract

This paper tackles a novel yet challenging problem: how to transfer knowledge from the emerging Segment Anything Model (SAM) -- which reveals impressive zero-shot instance segmentation capacity -- to learn a compact panoramic semantic segmentation model, i.e., student, without requiring any labeled data. This poses considerable challenges due to SAM's inability to provide semantic labels and the large capacity gap between SAM and the student. To this end, we propose a novel framework, called GoodSAM, that introduces a teacher assistant (TA) to provide semantic information, integrated with SAM to generate ensemble logits to achieve knowledge transfer. Specifically, we propose a Distortion-Aware Rectification (DAR) module that first addresses the distortion problem of panoramic images by imposing prediction-level consistency and boundary enhancement. This subtly enhances TA's prediction capacity on panoramic images. DAR then incorporates a cross-task complementary fusion block to adaptively merge the predictions of SAM and TA to obtain more reliable ensemble logits. Moreover, we introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the multi-level feature knowledge from TA and ensemble logits to learn a compact student model. Extensive experiments on two benchmarks show that our GoodSAM achieves a remarkable +3.75\% mIoU improvement over the state-of-the-art (SOTA) domain adaptation methods. Also, our most lightweight model achieves comparable performance to the SOTA methods with only 3.7M parameters.
Paper Structure (13 sections, 9 equations, 7 figures, 5 tables)

This paper contains 13 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Illustration of our GoodSAM, leveraging instance masks and boundary information provided by SAM, coupled with segmentation logits from the teacher assistant, to obtain reliable ensemble logits for knowledge adaptation to our student. (b) Our GoodSAM outperforms SOTA methods zhang2022bendingzheng2023bothzheng2023look across various model parameter ranges. Notably, GoodSAM-M achieves comparable performance to the SOTA methods with only 3.7M parameters.
  • Figure 2: Overview of GoodSAM framework, consisting of three models: SAM, teacher assistant, and student. Our method has two main technical components: the Distortion-Aware Rectification (DAR) module and the Multi-level Knowledge Adaptation (MKA) module.
  • Figure 3: Overview of the proposed boundary enhancement block. In (a), it represents the condition where the pixels at the same positions in all three images are boundary pixels. In (b), it represents the condition where pixels at the same positions are not on the boundary. Additionally, (a) demonstrates the optimization of the boundary enhancement loss for $B_{TA}^i$ and $B_{TA}^j$.
  • Figure 4: Example visualization results from the DensePASS test set: (a) Input panorama image, (b) Segformer-B5 xie2021segformer without sliding window sampling, (c) DPPASS-S zheng2023both, (d) DATR-S zheng2023look, (e) GoodSAM-S, (f) Ground truth.
  • Figure 5: Effectiveness of the CTCF block.
  • ...and 2 more figures