HEAL-SWIN: A Vision Transformer On The Sphere

Oscar Carlsson; Jan E. Gerken; Hampus Linander; Heiner Spieß; Fredrik Ohlsson; Christoffer Petersson; Daniel Persson

HEAL-SWIN: A Vision Transformer On The Sphere

Oscar Carlsson, Jan E. Gerken, Hampus Linander, Heiner Spieß, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson

TL;DR

HEAL-SWIN addresses distortions inherent to fisheye and spherical imagery by marrying the HEALPix spherical grid with the SWIN transformer’s windowed self-attention. The approach lifts patching, shifting, and attention to the sphere, enabling distortion-free, high-resolution processing with minimal overhead. Across real and synthetic automotive datasets and indoor spherical data, HEAL-SWIN improves semantic segmentation and depth estimation relative to flat or non-spherical baselines, while maintaining competitive inference efficiency. The work highlights practical impact for robotics and autonomous systems handling wide-field spherical inputs, and provides code for broader adoption.

Abstract

High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However, using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution, distortion-free spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer, enabling the network to process spherical representations with minimal computational overhead. We demonstrate the superior performance of our model on both synthetic and real automotive datasets, as well as a selection of other image datasets, for semantic segmentation, depth regression and classification tasks. Our code is publicly available at https://github.com/JanEGerken/HEAL-SWIN.

HEAL-SWIN: A Vision Transformer On The Sphere

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 14 figures, 12 tables)

This paper contains 35 sections, 1 equation, 14 figures, 12 tables.

Introduction
Related work
HEAL-SWIN
The SWIN transformer
The HEALPix grid
HEAL-SWIN
Patches and windows
Shifting
Relative position bias
Experiments
Semantic segmentation of fisheye street scenes
Real-world images
Synthetic images
Inference time
Dataset size ablation
...and 20 more sections

Figures (14)

Figure 1: Our HEAL-SWIN model uses the nested structure of the HEALPix grid to lift the windowed self-attention of the SWIN model onto the sphere.
Figure 2: Chamfer distance (lower is better) and mIoU (higher is better) for HEAL-SWIN (HS) and SWIN (S). Details are provided in Section \ref{['sec:depth-estimation']} and in Section \ref{['sec:seg-synthetic-images']}.
Figure 3: Grid shifting scheme for window size 16: The windows before the shift are framed in red and the patches are numbered in the nested scheme. After a shift by half a window size, the patches are divided into the windows framed in blue, so that e.g. patch 0 becomes patch 12 after the shift. The hashed regions are masked in the attention layer. The patches hashed horizontally correspond to the pixels marked in yellow in Figure \ref{['fig:shifting']} (left). They are filled with patches hashed vertically which correspond to the pixels lost in the center of Figure \ref{['fig:shifting']} (left).
Figure 4: Grid (left) and spiral (right) shifting strategies for the HEAL-SWIN transformer, projected onto the plane for visualization. The grid lines outline the eight base pixels used for this dataset, arrows indicate the directions in which pixels move. Highlighted regions are masked in the attention layers. Note that in grid shifting, pixels at boundaries of colliding base pixels are moved to the outer edge. In ring shifting, distortions are introduced towards the pole (center). For better visibility, the amount of shifting in these images is exaggerated.
Figure 5: Example of segmentation using SWIN (left) and HEAL-SWIN (right) on the Woodscape dataset of real automotive images. Overlays correspond to predicted segmentation masks. The pedestrian (overlayed in red) is only recognized by HEAL-SWIN.
...and 9 more figures

HEAL-SWIN: A Vision Transformer On The Sphere

TL;DR

Abstract

HEAL-SWIN: A Vision Transformer On The Sphere

Authors

TL;DR

Abstract

Table of Contents

Figures (14)