Table of Contents
Fetching ...

Open Panoramic Segmentation

Junwei Zheng, Ruiping Liu, Yufan Chen, Kunyu Peng, Chengzhi Wu, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

This work defines Open Panoramic Segmentation (OPS), enabling zero-shot, open-vocabulary segmentation on 360° panoramas while training on FoV-restricted pinhole data. It introduces OOOPS, a model that couples a frozen CLIP backbone with a Deformable Adapter Network (DAN) and introduces the Deformable Adapter Operator (DAO) to handle panorama distortions, augmented by Random Equirectangular Projection (RERP) to simulate distortion during training. Experiments on WildPASS, Stanford2D3D, and Matterport3D show that OOOPS with RERP achieves state-of-the-art gains in open panoramic segmentation, surpassing other open-vocabulary methods by up to ~2.4 percentage points in mIoU, while remaining competitive with some close-vocabulary panoramas. The approach advances practical, distortion-aware, zero-shot scene understanding in panoramic imagery and provides publicly available code for replication and extension.

Abstract

Panoramic images, capturing a 360° field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient dense-annotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images in the source domain in an open-vocabulary setting while evaluated with FoV-open panoramic images in the target domain, enabling the zero-shot open panoramic semantic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially +2.2% on outdoor WildPASS and +2.4% mIoU on indoor Stanford2D3D. The source code is publicly available at https://junweizheng93.github.io/publications/OPS/OPS.html.

Open Panoramic Segmentation

TL;DR

This work defines Open Panoramic Segmentation (OPS), enabling zero-shot, open-vocabulary segmentation on 360° panoramas while training on FoV-restricted pinhole data. It introduces OOOPS, a model that couples a frozen CLIP backbone with a Deformable Adapter Network (DAN) and introduces the Deformable Adapter Operator (DAO) to handle panorama distortions, augmented by Random Equirectangular Projection (RERP) to simulate distortion during training. Experiments on WildPASS, Stanford2D3D, and Matterport3D show that OOOPS with RERP achieves state-of-the-art gains in open panoramic segmentation, surpassing other open-vocabulary methods by up to ~2.4 percentage points in mIoU, while remaining competitive with some close-vocabulary panoramas. The approach advances practical, distortion-aware, zero-shot scene understanding in panoramic imagery and provides publicly available code for replication and extension.

Abstract

Panoramic images, capturing a 360° field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient dense-annotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images in the source domain in an open-vocabulary setting while evaluated with FoV-open panoramic images in the target domain, enabling the zero-shot open panoramic semantic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially +2.2% on outdoor WildPASS and +2.4% mIoU on indoor Stanford2D3D. The source code is publicly available at https://junweizheng93.github.io/publications/OPS/OPS.html.
Paper Structure (25 sections, 5 equations, 12 figures, 12 tables)

This paper contains 25 sections, 5 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: (a) The challenge of existing state-of-the-art segmentation models. (b) The limitation of categories in traditional close-vocabulary panoramic segmentation tasks. (c) Our newly defined Open Panoramic Segmentation (OPS) task aims at tackling the above challenges. OPS consists of three important elements: Open the FoV targeted at the challenge of 360° FoV, Open the Vocabulary targeted at the drawback of close-vocabulary panoramic segmentation and Open the Domain targeted at the challenge of scarcity of panoramic labels.
  • Figure 2: Overview of the OOOPS model architecture. It consists of a frozen CLIP model and a Deformable Adapter Network (DAN) which includes Transformer Layers and the proposed DAO.
  • Figure 3: Salient map generation in DAO.
  • Figure 4: (a) Visualization of ERP on a panoramic image and (b) Our proposed RERP on pinhole images.
  • Figure 5: (a) Comparison on the WildPASS dataset and (b) Visualization of the prediction from OOOPS in close- and open-vocabulary settings.
  • ...and 7 more figures