Table of Contents
Fetching ...

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, Xuming Hu

TL;DR

OmniSAM introduces a memory-enabled, patch-sequence framework to adapt SAM2 for panoramic semantic segmentation under unsupervised domain adaptation. By splitting panoramas into FoV-overlapping patches, leveraging SAM2’s memory for cross-patch consistency, and applying FoV-based prototypical adaptation plus dynamic pseudo-label updates, it achieves substantial gains over prior methods in pinhole-to-panoramic and synthetic-to-real benchmarks. The approach yields state-of-the-art results across indoor and outdoor scenarios with compact trainable parameters, underscoring practical viability for panoramic scene understanding. These contributions demonstrate a scalable path to transferring powerful foundation models to distorted panoramic data while preserving semantic precision.

Abstract

Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

TL;DR

OmniSAM introduces a memory-enabled, patch-sequence framework to adapt SAM2 for panoramic semantic segmentation under unsupervised domain adaptation. By splitting panoramas into FoV-overlapping patches, leveraging SAM2’s memory for cross-patch consistency, and applying FoV-based prototypical adaptation plus dynamic pseudo-label updates, it achieves substantial gains over prior methods in pinhole-to-panoramic and synthetic-to-real benchmarks. The approach yields state-of-the-art results across indoor and outdoor scenarios with compact trainable parameters, underscoring practical viability for panoramic scene understanding. These contributions demonstrate a scalable path to transferring powerful foundation models to distorted panoramic data while preserving semantic precision.

Abstract

Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to domain, the significant field-of-view (FoV) gap between pinhole () and panoramic images () poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

Paper Structure

This paper contains 23 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An overview of OmniSAM framework. First, OmniSAM is trained on the source domain to obtain the source model. Then, the FoV-based prototypical adaptation module is employed for cross-domain feature alignment.
  • Figure 2: Dynamic pseudo-label updating mechanism.
  • Figure 3: FoV-based Prototypical Adaptation.
  • Figure 4: Visualizations on DensePASS dataset.
  • Figure 5: Visualizations on Stanford2D3D-Panoramic dataset.
  • ...and 5 more figures