Table of Contents
Fetching ...

DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

Akshaya Athwale, Ichrak Shili, Émile Bergeron, Ola Ahmad, Jean-François Lalonde

TL;DR

The paper tackles pixel-level tasks on wide-angle fisheye images where distortions break translational symmetry, modeling distortion with the Unified camera model parameter $\xi \in [0,1]$ and projection $r_d = \mathcal{P}(\theta)$. It extends the distortion-aware radial Swin Transformer (DarSwin) into a full encoder-decoder architecture called DarSwin-Unet and adds a novel sampling function $g(\theta)$ to reduce input sparsity, enabling robust depth estimation. The architecture introduces an azimuth patch expanding layer and a fixed $k$-NN projection to map polar features back to Cartesian pixels, facilitating high-quality pixel-level outputs. Experiments on synthetic Matterport3D-based wide-angle data show state-of-the-art zero-shot generalization to unseen distortions and robustness across distortion levels, outperforming Swin-Unet, Swin-UPerNet, and DAT-UPerNet baselines.

Abstract

Wide-angle fisheye images are becoming increasingly common for perception tasks in applications such as robotics, security, and mobility (e.g. drones, avionics). However, current models often either ignore the distortions in wide-angle images or are not suitable to perform pixel-level tasks. In this paper, we present an encoder-decoder model based on a radial transformer architecture that adapts to distortions in wide-angle lenses by leveraging the physical characteristics defined by the radial distortion profile. In contrast to the original model, which only performs classification tasks, we introduce a U-Net architecture, DarSwin-Unet, designed for pixel level tasks. Furthermore, we propose a novel strategy that minimizes sparsity when sampling the image for creating its input tokens. Our approach enhances the model capability to handle pixel-level tasks in wide-angle fisheye images, making it more effective for real-world applications. Compared to other baselines, DarSwin-Unet achieves the best results across different datasets, with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses.

DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

TL;DR

The paper tackles pixel-level tasks on wide-angle fisheye images where distortions break translational symmetry, modeling distortion with the Unified camera model parameter and projection . It extends the distortion-aware radial Swin Transformer (DarSwin) into a full encoder-decoder architecture called DarSwin-Unet and adds a novel sampling function to reduce input sparsity, enabling robust depth estimation. The architecture introduces an azimuth patch expanding layer and a fixed -NN projection to map polar features back to Cartesian pixels, facilitating high-quality pixel-level outputs. Experiments on synthetic Matterport3D-based wide-angle data show state-of-the-art zero-shot generalization to unseen distortions and robustness across distortion levels, outperforming Swin-Unet, Swin-UPerNet, and DAT-UPerNet baselines.

Abstract

Wide-angle fisheye images are becoming increasingly common for perception tasks in applications such as robotics, security, and mobility (e.g. drones, avionics). However, current models often either ignore the distortions in wide-angle images or are not suitable to perform pixel-level tasks. In this paper, we present an encoder-decoder model based on a radial transformer architecture that adapts to distortions in wide-angle lenses by leveraging the physical characteristics defined by the radial distortion profile. In contrast to the original model, which only performs classification tasks, we introduce a U-Net architecture, DarSwin-Unet, designed for pixel level tasks. Furthermore, we propose a novel strategy that minimizes sparsity when sampling the image for creating its input tokens. Our approach enhances the model capability to handle pixel-level tasks in wide-angle fisheye images, making it more effective for real-world applications. Compared to other baselines, DarSwin-Unet achieves the best results across different datasets, with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses.
Paper Structure (20 sections, 5 equations, 11 figures, 1 table)

This paper contains 20 sections, 5 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Radial divisions adapt to the lens distortion; here, we show low (left) and high (right) distortion for illustration purposes. DarSwin athwale2023darswin separates radial patches equally along $\theta$ and determines the corresponding radius on the image plane according to the (known) lens distortion curve $r_d = \mathcal{P}(\theta)$.
  • Figure 2: For illustration, the wide angle image is divided into 16 patches ($N_r = 2$ and $N_\varphi = 8$) along radius and azimuth. Nine samples are defined per patch: 3 samples along the radius and 3 samples along the azimuth. The image is bilinearly sampled and arranged in radial-azimuth format. This feature map is passed through CNN to embed each patch to get a feature map of dimension $N_r \times N_\varphi \times \text{embed-dim}$.
  • Figure 3: Overview of our distortion-aware transformer encoder-decoder architecture, DarSwin-Unet. It employs hierarchical layers of DarSwin transformer blocks athwale2023darswin (top row) and replicates the structure in the decoder (similar to Swin-Unet cao2021swinunet). To make the architecture adapt to lens distortion, the patch partition, linear embedding, patch merging, and patch expanding layers, all take the lens projection curve $\mathcal{P}(\theta)$ (c.f. \ref{['sec:background']}) as input. The $k$-NN layer is used to project the feature map from polar ($N_r \times N_\varphi)$ to cartesian space $H \times W$.
  • Figure 4: Illustration of sampling (represented by colored dots) on a quadrant of an image taken from two different lenses ($\xi$ = 0 (top row) and $\xi = 1$ (bottom row). The images is sampled according to the lens distortion curve $\mathcal{P}$ applied on different functions of $\theta$: (a) $\theta$, (b) $\tan \theta$, and (c) our novel $g(\theta)$. Observe how the first two options create large holes at either extreme values of $\xi$. In contrast, our proposed function offers a good compromise across a wide range of distortions.
  • Figure 5: Lens distortion curves for least ($\xi=0$) to most ($\xi=1$) distorted using the unified camera model for illustration. We represent the same curves according to, from left to right, $\tan \theta$, our new $g(\theta)$, and $\theta$. The high slopes present in both $\tan \theta$ and $\theta$ curves mean that samples will be spread far apart on the image plane. In contrast, our $g(\theta)$ offers a good compromise across the range of distortions.
  • ...and 6 more figures