Table of Contents
Fetching ...

Multi-source Domain Adaptation for Panoramic Semantic Segmentation

Jing Jiang, Sicheng Zhao, Jiankun Zhu, Wenbo Tang, Zhaopan Xu, Jidong Yang, Guoping Liu, Tengfei Xing, Pengfei Xu, Hongxun Yao

TL;DR

The paper tackles panoramic semantic segmentation in a multi-source domain setting by introducing MSDA4PASS, which leverages labeled real pinhole images and synthetic panoramic data to improve segmentation on unlabeled real panoramas. It presents DTA4PASS, comprising Unpaired Semantic Morphing (USM) to bridge distortion via a learnable, unpaired deformation, and Distortion Gating Alignment (DGA) to bridge texture gaps through pin- and pan-like feature gating and uncertainty-guided alignment. The approach achieves state-of-the-art results in outdoor and indoor panoramic benchmarks, demonstrating strong gains over single-source, multi-source, and panoramic-domain baselines, with robust ablations supporting the necessity of USM and DGA. The work offers a practical pathway for scalable panoramic scene understanding in applications like autonomous driving and robotics by effectively exploiting readily available pinhole and synthetic panoramic data while avoiding heavy reliance on costly real panoramic annotations.

Abstract

Unsupervised domain adaptation methods for panoramic semantic segmentation utilize real pinhole images or low-cost synthetic panoramic images to transfer segmentation models to real panoramic images. However, these methods struggle to understand the panoramic structure using only real pinhole images and lack real-world scene perception with only synthetic panoramic images. Therefore, in this paper, we propose a new task, Multi-source Domain Adaptation for Panoramic Semantic Segmentation (MSDA4PASS), which leverages both real pinhole and synthetic panoramic images to improve segmentation on unlabeled real panoramic images. There are two key issues in the MSDA4PASS task: (1) distortion gaps between the pinhole and panoramic domains -- panoramic images exhibit global and local distortions absent in pinhole images; (2) texture gaps between the source and target domains -- scenes and styles differ across domains. To address these two issues, we propose a novel framework, Deformation Transform Aligner for Panoramic Semantic Segmentation (DTA4PASS), which converts all pinhole images in the source domains into distorted images and aligns the source distorted and panoramic images with the target panoramic images. Specifically, DTA4PASS consists of two main components: Unpaired Semantic Morphing (USM) and Distortion Gating Alignment (DGA). First, in USM, the Dual-view Discriminator (DvD) assists in training the diffeomorphic deformation network at the image and pixel level, enabling the effective deformation transformation of pinhole images without paired panoramic views, alleviating distortion gaps. Second, DGA assigns pinhole-like (pin-like) and panoramic-like (pan-like) features to each image by gating, and aligns these two features through uncertainty estimation, reducing texture gaps.

Multi-source Domain Adaptation for Panoramic Semantic Segmentation

TL;DR

The paper tackles panoramic semantic segmentation in a multi-source domain setting by introducing MSDA4PASS, which leverages labeled real pinhole images and synthetic panoramic data to improve segmentation on unlabeled real panoramas. It presents DTA4PASS, comprising Unpaired Semantic Morphing (USM) to bridge distortion via a learnable, unpaired deformation, and Distortion Gating Alignment (DGA) to bridge texture gaps through pin- and pan-like feature gating and uncertainty-guided alignment. The approach achieves state-of-the-art results in outdoor and indoor panoramic benchmarks, demonstrating strong gains over single-source, multi-source, and panoramic-domain baselines, with robust ablations supporting the necessity of USM and DGA. The work offers a practical pathway for scalable panoramic scene understanding in applications like autonomous driving and robotics by effectively exploiting readily available pinhole and synthetic panoramic data while avoiding heavy reliance on costly real panoramic annotations.

Abstract

Unsupervised domain adaptation methods for panoramic semantic segmentation utilize real pinhole images or low-cost synthetic panoramic images to transfer segmentation models to real panoramic images. However, these methods struggle to understand the panoramic structure using only real pinhole images and lack real-world scene perception with only synthetic panoramic images. Therefore, in this paper, we propose a new task, Multi-source Domain Adaptation for Panoramic Semantic Segmentation (MSDA4PASS), which leverages both real pinhole and synthetic panoramic images to improve segmentation on unlabeled real panoramic images. There are two key issues in the MSDA4PASS task: (1) distortion gaps between the pinhole and panoramic domains -- panoramic images exhibit global and local distortions absent in pinhole images; (2) texture gaps between the source and target domains -- scenes and styles differ across domains. To address these two issues, we propose a novel framework, Deformation Transform Aligner for Panoramic Semantic Segmentation (DTA4PASS), which converts all pinhole images in the source domains into distorted images and aligns the source distorted and panoramic images with the target panoramic images. Specifically, DTA4PASS consists of two main components: Unpaired Semantic Morphing (USM) and Distortion Gating Alignment (DGA). First, in USM, the Dual-view Discriminator (DvD) assists in training the diffeomorphic deformation network at the image and pixel level, enabling the effective deformation transformation of pinhole images without paired panoramic views, alleviating distortion gaps. Second, DGA assigns pinhole-like (pin-like) and panoramic-like (pan-like) features to each image by gating, and aligns these two features through uncertainty estimation, reducing texture gaps.
Paper Structure (22 sections, 14 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 14 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of our method with other panoramic domain adaptation methods. The previous methods utilize only real pinhole or synthetic panoramic images. Meanwhile, they either (a) directly perform alignment, or (b) slice panoramic images into pinhole-like patches before alignment. (c) Our method converts all source pinhole images into distorted images and aligns source distorted images and source panoramic images with target panoramic images, perceiving both real-world scenes and panoramic structures.
  • Figure 2: An overall illustration of the proposed DTA4PASS. To bridge the distortion gap between pinhole and panoramic images, USM (Sec. \ref{['sec:USM']}) converts all pinhole images into distorted images, referring to the source panoramic images $I_{pan}$ and the target panoramic images $I_{t}$. To bridge the texture gap between source and target images, DGA (Sec. \ref{['sec:DGA']}) performs feature alignment between the class-mixed augmented source distorted/panoramic images and the target panoramic images.
  • Figure 3: Overview of Unpaired Semantic Morphing (USM), where brighter colors in pixel-level discrimination map indicate more pan-like. The pinhole and panoramic images $x_i$ and $x_a$ are fed into deformation network $F$ to obtain the deformation fields, as well as the deformed images $x_i \circ \phi_{i2a}$ and $x_a \circ \phi_{a2i}$. Afterward, the proposed Dual-view Discriminator (DvD) performs image-level discrimination $\mathcal{L}_{adv}^{img}$ and pixel-level discrimination $\mathcal{L}_{adv}^{pix}$ on the deformed images, assisting the deformation network $F$ in generating a deformation field that can transform the pinhole image $x_i$ into a distorted image similar to the panoramic image $x_a$, thereby mitigating the distortion gap between pinhole and panoramic domains.
  • Figure 4: Illustration of Distortion Gating Alignment (DGA). Given the source distorted pinhole images obtained from $\{U^i\}^{M}_{i=1}$ and the source panoramic images from $\{V^i\}^{N}_{i=1}$, they are fed into the pinhole (yellow) and panoramic (orange) branches respectively to train two auxiliary segmentation heads. For mixed and target images, they are fed into the target branch (grey). The gating module $g$ allocates pin-like features $f_{pin}$ and pan-like features $f_{pan}$ for input images at the pixel level. Finally, the uncertainty estimation module reduces the difference between these two features to alleviate the texture gap between source and target domains.
  • Figure 5: Results of omnidirectional semantic segmentation on outdoor and indoor scenes. With the exception of the multi-source DA methods DTA4PASS and MS2PL, all other methods use a Combined DA setting.
  • ...and 7 more figures