DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

Monika Kwiatkowski; Simon Matern; Olaf Hellwich

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

Monika Kwiatkowski, Simon Matern, Olaf Hellwich

TL;DR

This work tackles the problem of reconstructing and aligning distorted image sequences that are related by 2D homographies $H_{ij}$. It introduces a ray-traced synthetic dataset with ground-truth homographies and investigates neural feature maps and Video Swin Transformers to perform joint alignment and aggregation for reconstruction. Through comparisons with Deep Residual Sets and various Swin-based aggregation schemes, the study shows attention-based methods better handle outliers and distortions, with softmax-weighted aggregation delivering the strongest results. An end-to-end DIAR pipeline demonstrates the potential of combining neural descriptors with spatio-temporal attention, offering a practical approach to robust image reconstruction in challenging, artifact-rich sequences, while highlighting avenues for refinement such as bundle adjustment for finer alignment.

Abstract

When taking images of some occluded content, one is often faced with the problem that every individual image frame contains unwanted artifacts, but a collection of images contains all relevant information if properly aligned and aggregated. In this paper, we attempt to build a deep learning pipeline that simultaneously aligns a sequence of distorted images and reconstructs them. We create a dataset that contains images with image distortions, such as lighting, specularities, shadows, and occlusion. We create perspective distortions with corresponding ground-truth homographies as labels. We use our dataset to train Swin transformer models to analyze sequential image data. The attention maps enable the model to detect relevant image content and differentiate it from outliers and artifacts. We further explore using neural feature maps as alternatives to classical key point detectors. The feature maps of trained convolutional layers provide dense image descriptors that can be used to find point correspondences between images. We utilize this to compute coarse image alignments and explore its limitations.

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

TL;DR

This work tackles the problem of reconstructing and aligning distorted image sequences that are related by 2D homographies

. It introduces a ray-traced synthetic dataset with ground-truth homographies and investigates neural feature maps and Video Swin Transformers to perform joint alignment and aggregation for reconstruction. Through comparisons with Deep Residual Sets and various Swin-based aggregation schemes, the study shows attention-based methods better handle outliers and distortions, with softmax-weighted aggregation delivering the strongest results. An end-to-end DIAR pipeline demonstrates the potential of combining neural descriptors with spatio-temporal attention, offering a practical approach to robust image reconstruction in challenging, artifact-rich sequences, while highlighting avenues for refinement such as bundle adjustment for finer alignment.

Abstract

Paper Structure (22 sections, 5 equations, 19 figures)

This paper contains 22 sections, 5 equations, 19 figures.

Introduction
Related Work
Deep Image Alignment
Deep Image Stitching
(Vision) Transformers
Dataset
Aligned Dataset
Misaligned Dataset
Homography
Deep Image Alignment
Architecture
Deep Residual Sets
Video Swin Transformer
Image Reconstruction using Swin Transformers
Training
...and 7 more sections

Figures (19)

Figure 1: Illustration of a randomly generated scene using Blender. The plane shows a painting. The white pyramids describe randomly generated cameras; the yellow cone describes a spotlight. Geometric objects serve as occlusions and cast shadows onto the plane.
Figure 2: Four randomly generated images that are aligned.
Figure 3: The image shows the output of our rendering pipeline when only ambient lighting is used. The image is free of any artifacts.
Figure 4: Four images containing perspective distortions. The first image is aligned with the camera's view, but it also contains image distortions.
Figure 5: A pair of images and their corresponding feature maps.
...and 14 more figures

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

TL;DR

Abstract

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (19)