Table of Contents
Fetching ...

360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers

Jack Hilliard, Adrian Hilton, Jean-Yves Guillemaut

TL;DR

This work proposes a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format, and is the first purely Vision-Transformer model used in the field of illumination estimation.

Abstract

Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.

360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers

TL;DR

This work proposes a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format, and is the first purely Vision-Transformer model used in the field of illumination estimation.

Abstract

Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.

Paper Structure

This paper contains 19 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Examples of an object with different surface properties being rendered with an HDRI environment map (EM) of an indoor ( Top) and outdoor ( Bottom) scene, from either the ground truth or generated by our network 360U-Former with PanoSWIN attention blocks. We also include the EM for each scene and method for reference.
  • Figure 2: Summary of the proposed model. Top: The overall flow of the model. The generator (ICN) uses the masked LDR ERP as input to generate the HDR ERP environment map. This is trained as a GAN by the discriminator ($D_{ICN}$). Bottom: The 360U-Former architecture used by the ICN. The PanoSWIN attention blocks W-MSA, PSW-MSA and PAM are desribed in \ref{['ssec:Network']} and \ref{['fig:panoswin']}.
  • Figure 3: The ERP rotations that are used as input by each of the three attention layers.
  • Figure 4: Indoor qualitative comparison of our generated ERPs in LDR with other methods. For each method and input LFOV image we show the LDR ERP rotated 180$^\circ$ to show any potential border seams and the LDR ERP rotated 90$^\circ$ by 90$^\circ$ to compare the generation at the poles of the ERP. We only include a selection of the methods from the quantitative comparison, the remaining methods can be found in the supplementary material. For the ground truth, we show the input to the network and include a dotted box around that area in the panorama.
  • Figure 5: Outdoor qualitative comparison of our generated ERPs in LDR with other methods. For each method and input LFOV image we show the LDR ERP rotated 180$^\circ$ to show any potential border seams and the LDR ERP rotated 90$^\circ$ by 90$^\circ$ to compare the generation at the poles of the ERP.
  • ...and 1 more figures