Table of Contents
Fetching ...

SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception

Yaniv Benny, Lior Wolf

TL;DR

A transformer-based architecture that, by incorporating a novel "Spherical Local Self-Attention" and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360° perception benchmarks for depth estimation and semantic segmentation is introduced.

Abstract

This paper proposes a novel method for omnidirectional 360$\degree$ perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel ``Spherical Local Self-Attention'' and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360$\degree$ perception benchmarks for depth estimation and semantic segmentation.

SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception

TL;DR

A transformer-based architecture that, by incorporating a novel "Spherical Local Self-Attention" and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360° perception benchmarks for depth estimation and semantic segmentation is introduced.

Abstract

This paper proposes a novel method for omnidirectional 360 perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel ``Spherical Local Self-Attention'' and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360 perception benchmarks for depth estimation and semantic segmentation.

Paper Structure

This paper contains 30 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Sphere Representations. From left to right: uvsphere, cubesphere, icosphere, hexasphere.
  • Figure 2: Icospheres of different ranks. An increase in rank is made by subdividing each triangle into 4 smaller triangles, and results in an increased resolution.
  • Figure 3: Data as Icospheres. Icosphere of rank 3 (a1-c1) and rank 6 (a2-c2). (a) RGB, (b) Depth Map, and (c) Semantic Layout.
  • Figure 4: The SphereUFormer architecture. A spherical representation is fed into the model. A linear input projection layer encodes the RGB values to latent embedding vectors. A sequence of SAM modules apply local self-attention on the spherical data along with downsampling layers that gradually reduce the resolution of the sphere. A sequence of SAM modules along with upsampling layer and bypass skip connections decode the data. An output projection converts the latent embeddings to the output channel size.
  • Figure 5: Spherical Local Self Attention. Attention is applied between each data node and its $K$ neighbors. A learned relative position bias encodes information about the neighbors' relative position. In the right corner is a diagram of the enclosing block.
  • ...and 8 more figures