Table of Contents
Fetching ...

Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration

Hao Ai, Lin Wang

TL;DR

Elite360M tackles the challenge of learning 360° scene understanding across depth, surface normals, and semantics by introducing an ICOSAP-based distortion-free representation and a compact Bi-projection Bi-attention Fusion to fuse ERP and ICOSAP features. The framework further enhances cross-task synergy with a Cross-task Collaboration module that disentangles task-specific information and enables cross-task spatial context sharing via attention. Extensive experiments on Matterport3D and Structured3D demonstrate clear advantages over 360° multi-task baselines in depth and normal estimation and competitive performance against single-task methods, while maintaining higher efficiency. The results highlight the practical potential for robust, edge-friendly 360° perception systems that jointly reason about geometry and semantics, enabling more capable autonomous navigation and immersive experiences.

Abstract

360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image's ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.

Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration

TL;DR

Elite360M tackles the challenge of learning 360° scene understanding across depth, surface normals, and semantics by introducing an ICOSAP-based distortion-free representation and a compact Bi-projection Bi-attention Fusion to fuse ERP and ICOSAP features. The framework further enhances cross-task synergy with a Cross-task Collaboration module that disentangles task-specific information and enables cross-task spatial context sharing via attention. Extensive experiments on Matterport3D and Structured3D demonstrate clear advantages over 360° multi-task baselines in depth and normal estimation and competitive performance against single-task methods, while maintaining higher efficiency. The results highlight the practical potential for robust, edge-friendly 360° perception systems that jointly reason about geometry and semantics, enabling more capable autonomous navigation and immersive experiences.

Abstract

360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image's ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.
Paper Structure (37 sections, 8 equations, 13 figures, 7 tables)

This paper contains 37 sections, 8 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: (a) Overview of Elite360M. It employs the B2F module to learn the 360-specific representations from a local-with-global perspective and introduces CoCo module to model the cross-task information interaction. Consequently, with three simple decoder heads, Elite360M accomplishes three scene understanding tasks simultaneously. (b) Performance of three tasks on Matterport3D test dataset Chang2017Matterport3DLF (Root Mean Square Error (RMSE) for depth estimation, mean error of angles (Mean) for surface normal estimation, and pixel accuracy for semantic segmentation). Lower error or larger accuracy is better. Notably, UniFuse-M and BiFuse++-M are the multi-task learning frameworks built on UniFuse Jiang2021UniFuseUF and BiFuse++ Wang2022BiFuseSA, respectively. We follow the original decoder structures to build three decoder branches for three tasks. InPvT++ is the SOTA multi-task learning method for conventional planar images. In particular, UniFuse-M, BiFuse++-M and our Elite360M are with the ResNet-34 as the ERP encoder backbone, while InPvT++ employs much larger ViT-Large dosovitskiy2021an as the backbone.
  • Figure 2: Different planar projections of a spherical imaging panorama: a) equirectangular projection (ERP); b) cubemap projection (CP); c) tangent projection (TP) (captured following Li2022OmniFusion3M) with the FoV of $(80^\circ,80^\circ)$.
  • Figure 3: Icosahedron projection (ICOSAP) with different subdivision levels.
  • Figure 4: Qualitative comparison of one test sample of Matterport dataset among Elite360M and other multi-task learning baselines.
  • Figure 5: An overview of our Elite360M framework, comprising image-based ERP feature extraction (Sec. \ref{['sec:erp_feature']}), point-based ICOSAP feature extraction (Sec. \ref{['sec:ico_feature']}), Bi-projection Bi-attention fusion (B2F) (Sec. \ref{['sec:b2f']}), Cross-task Collaboration (CoCo) (Sec. \ref{['sec:coco']}), and three task-specific decoders (Sec. \ref{['sec:loss']}). Notably, We employ the skip connections ronneberger2015u at the decoding stage.
  • ...and 8 more figures