Table of Contents
Fetching ...

MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

Jingdong Zhang, Xiaohang Zhan, Lingzhi Zhang, Yizhou Wang, Zhengming Yu, Jionghao Wang, Wenping Wang, Xin Li

TL;DR

MTPano tackles the challenge of panoramic multi-task scene understanding under limited pixel-wise annotations by leveraging dense priors from perspective foundation models in a label-free training pipeline. It introduces Panorama-Dual-BridgeNet (PD-BridgeNet), a dual-stream architecture that disentangles rotation-invariant and rotation-variant features and uses a distortion-aware ERP Token Mixer with a Gradient-Truncated Bridge to enable safe cross-task interactions. Auxiliary dense priors (Image Gradient, Edge Distance Field, Metric Point Map) further fertilize cross-task learning, while data-side pseudo-labels are generated via randomized perspective crops and re-projected to the sphere. Extensive experiments on Structured3D and Stanford2D3D show state-of-the-art performance across semantic segmentation, depth, and surface normals, with robust generalization to in-the-wild panoramas. The approach significantly reduces annotation needs while delivering high-fidelity, consistent panoramic parsing, offering a scalable path toward unified panoramic perception systems.

Abstract

Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.

MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

TL;DR

MTPano tackles the challenge of panoramic multi-task scene understanding under limited pixel-wise annotations by leveraging dense priors from perspective foundation models in a label-free training pipeline. It introduces Panorama-Dual-BridgeNet (PD-BridgeNet), a dual-stream architecture that disentangles rotation-invariant and rotation-variant features and uses a distortion-aware ERP Token Mixer with a Gradient-Truncated Bridge to enable safe cross-task interactions. Auxiliary dense priors (Image Gradient, Edge Distance Field, Metric Point Map) further fertilize cross-task learning, while data-side pseudo-labels are generated via randomized perspective crops and re-projected to the sphere. Extensive experiments on Structured3D and Stanford2D3D show state-of-the-art performance across semantic segmentation, depth, and surface normals, with robust generalization to in-the-wild panoramas. The approach significantly reduces annotation needs while delivering high-fidelity, consistent panoramic parsing, offering a scalable path toward unified panoramic perception systems.

Abstract

Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
Paper Structure (26 sections, 7 equations, 9 figures, 3 tables)

This paper contains 26 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the MTPano framework. We employ a label-free pipeline (top) that integrates dense priors from perspective foundation models via patch-wise supervision. We propose PD-BridgeNet (bottom), a dual-stream architecture that disentangles rotation-invariant and variant features via geometry-aware modulation ($M_{inv}$ and $M_{var}$). The streams are harmonized by a Truncated Gradient Flow mechanism, which facilitates synergistic information exchange while preventing optimization interference across branches. Auxiliary dense task supervisions are involved to aid the task interaction process: Image Gradient, Edge Distance Field (EDF), and Metric Point Map.
  • Figure 2: (a) We classify dense prediction tasks into rotation-invariant (e.g., Semseg, Depth) and rotation-variant (e.g., Normal) groups based on their dependency on absolute observer orientation. The same region on the rotation-invariant feature remains consistent when rotation on the yaw angle is applied, while the rotation-variant feature doesn't keep this consistency. (b) The ERP Token Mixer mitigates spherical distortion by dynamically fusing standard ($3\times3$) and wide ($3\times9$) kernels based on pixel latitude. (c) The proposed Panorama-Dual-BridgeNet. We disentangle feature learning into Invariant and Variant Stream via Geometry Modulation layers ($M_{inv}$ and $M_{var}$). The two streams are harmonized by a Gradient-Truncated BridgeNet, which aggregates initial predictions (Semantic Segmentation$^{\textcircled{1}}$, Depth$^{\textcircled{2}}$, Surface Normals$^{\textcircled{5}}$) with dense auxiliary cues (Image Gradient$^{\textcircled{3}}$, Edge Distance Field$^{\textcircled{4}}$, Point Map$^{\textcircled{6}}$) via Cross-Attention to provide thorough interactions while blocking the backward propagation of conflicting gradients.
  • Figure 3: Qualitative comparisons on Structured3D. (a) MTPano outperforms single-task specialists (OPS zheng2024open, DAP lin2025depth, PanoNormal huang2024panonormal) and the multi-task baseline TaskPrompter taskprompter2023, achieving superior segmentation accuracy and geometric detail via PD-BridgeNet's interaction. (b) Point cloud reconstruction comparison demonstrating MTPano's better structural consistency against TaskPrompter.
  • Figure 4: Qualitative comparisons with single task specialist models and multi-task models on Stanford2D3D.
  • Figure 5: Analysis of cross-task learning and feature attributes. (a) Multi-task interaction effectively eliminates projection artifacts in pseudo-labels. For instance, consistent semantic masks guide surface normal refinement (top), yielding predictions superior to the noisy supervision. (b) Feature visualization under rotation. While backbone features ($\mathcal{F}_{backbone}$) exhibit entangled attributes, our approach successfully disentangles them into rotation-stable invariant features ($\mathcal{F}_{inv}$) and orientation-sensitive variant features ($\mathcal{F}_{var}$).
  • ...and 4 more figures