Table of Contents
Fetching ...

Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

Mukai Yu, Mosam Dabhi, Liuyue Xie, Sebastian Scherer, László A. Jeni

TL;DR

The paper tackles the challenge that planar CNNs poorly capture the geometry of wide-FoV imagery. It introduces the Unified Spherical Frontend (USF), a lens-agnostic framework that projects images from arbitrary cameras onto the unit sphere using ray-direction mappings and performs resampling, convolution, and pooling directly on the sphere. By employing distance-based, rotation-aware spherical kernels and a modular, decoupled design for projection, resampling, and feature aggregation, USF achieves rotation-equivariance without spherical-harmonic transforms and supports zero-shot generalization across unseen lenses. Across MNIST classification, panoramic object detection, and cross-lens semantic segmentation, USF demonstrates robustness to random rotations, preserves competitive accuracy, and enables cross-lens adaptability, highlighting its practical potential for robust, geometry-aware vision in robotics and AR/VR settings.

Abstract

Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that transforms images from any calibrated camera into a unit-sphere representation via ray-direction correspondences, and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, interpolation, and resolution control are fully decoupled. Its distance-only spherical kernels offer configurable rotation-equivariance (mirroring translation-equivariance in planar CNNs) while avoiding harmonic transforms entirely. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D-3D-S), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains less than 1% performance drop under random test-time rotations, even without rotational augmentation, and even enables zero-shot generalization from one lens type to unseen wide-FoV lenses with minimal performance degradation.

Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

TL;DR

The paper tackles the challenge that planar CNNs poorly capture the geometry of wide-FoV imagery. It introduces the Unified Spherical Frontend (USF), a lens-agnostic framework that projects images from arbitrary cameras onto the unit sphere using ray-direction mappings and performs resampling, convolution, and pooling directly on the sphere. By employing distance-based, rotation-aware spherical kernels and a modular, decoupled design for projection, resampling, and feature aggregation, USF achieves rotation-equivariance without spherical-harmonic transforms and supports zero-shot generalization across unseen lenses. Across MNIST classification, panoramic object detection, and cross-lens semantic segmentation, USF demonstrates robustness to random rotations, preserves competitive accuracy, and enables cross-lens adaptability, highlighting its practical potential for robust, geometry-aware vision in robotics and AR/VR settings.

Abstract

Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that transforms images from any calibrated camera into a unit-sphere representation via ray-direction correspondences, and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, interpolation, and resolution control are fully decoupled. Its distance-only spherical kernels offer configurable rotation-equivariance (mirroring translation-equivariance in planar CNNs) while avoiding harmonic transforms entirely. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D-3D-S), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains less than 1% performance drop under random test-time rotations, even without rotational augmentation, and even enables zero-shot generalization from one lens type to unseen wide-FoV lenses with minimal performance degradation.

Paper Structure

This paper contains 46 sections, 23 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Unified Spherical Representation. From any camera to any architecture: a unified spherical pipeline for modern vision.
  • Figure 2: Rotation Equivariance and Invariance. A function $\mathcal{K}$ is rotation-equivariant if $\textcolor{OliveGreen}{\mathcal{K}_E}(R \cdot \mathbf{x}) = R \cdot \textcolor{OliveGreen}{\mathcal{K}_E}(\mathbf{x})$, and rotation-invariant if $\textcolor{BrickRed}{\mathcal{K}_I}(R \cdot \mathbf{x}) = \textcolor{BrickRed}{\mathcal{K}_I}(\mathbf{x})$, for all $R \in \mathrm{SO}(3)$.
  • Figure 3: Unified Spherical Frontend. (i) A planar image and its lens normal map can be combined to form a (ii) spherical image. Cameras with different lenses produce spatially varying densities and distributions of pixels when projected onto the sphere. Thus, it is crucial to perform (iii) resampling before (iv) feeding into the backbone composed of spherical convolution and pooling layer. Optionally, the results can be (v) resampled back into the raw projected spherical image pixel locations, and (vi) unproject back to the planar image for downstream integration.
  • Figure 4: Spherical Sampling Methods. Various location sampling strategies produce different levels of uniformity across the sphere. The bottom row displays point distributions with higher uniformity compared to coarser Goldberg polyhedron discretizations.
  • Figure 5: Spherical Convolution and Pooling. The output locations are set to be identical to the input locations. (b) visualizes a channel of convolution output with weight = 1 and bias = 0, effectively a summation operator.
  • ...and 4 more figures