Table of Contents
Fetching ...

SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

Qinfeng Zhu, Yunxi Jiang, Lei Fan

TL;DR

SO3UFormer is introduced, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame, and Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$.

Abstract

Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.

SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

TL;DR

SO3UFormer is introduced, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame, and Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within .

Abstract

Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within . Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.
Paper Structure (19 sections, 20 equations, 3 figures, 2 tables)

This paper contains 19 sections, 20 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Breaking the Gravity Lock: Rotation Robustness in Panoramic Segmentation. We present a comparison between a canonical upright view (a) and the same scene under an arbitrary SO(3) rotation (b), mimicking real-world unconstrained motion. (c) shows the Ground Truth semantic map for the rotated input. (d) The state-of-the-art SphereUFormerbenny2025sphereuformer fails catastrophically on the rotated input, as it relies on absolute latitude cues (gravity bias) and cannot recognize the tilted geometry. (e) In contrast, our SO3UFormer reduces reliance on global coordinate cues and produces a consistent segmentation that closely matches the ground truth. (f) Quantitative results confirm that while the baseline performance collapses by over 42% mIoU under rotation, our method maintains robust accuracy (70.7%), effectively closing the SO(3) domain gap.
  • Figure 2: SO3UFormer overview. A U-shaped spherical Transformer with gauge-aware, quadrature-consistent local attention, geometry-consistent down/up sampling, and an SO(3)-consistency regularizer via spherical index-based resampling.
  • Figure 3: Qualitative comparison under the out-of-distribution SO(3) stress test on Pose35 validation. Three representative scenes are shown (a)--(c), each evaluated under arbitrary 3D reorientation. For compact presentation, the methods are arranged across the left and right halves of the figure, but all predictions correspond to the same rotated inputs. Compared with SFSS, HealSwin, Elite360, and SphereUFormer, our SO3UFormer produces substantially more stable and semantically coherent layouts under full SO(3) perturbations, with reduced large-scale label drift and structural inconsistencies. The color legend at the bottom shows the 13-class palette used for visualization (the unknown class is excluded from mIoU evaluation).