DIAR: Deep Image Alignment and Reconstruction using Swin Transformers
Monika Kwiatkowski, Simon Matern, Olaf Hellwich
TL;DR
This work tackles the problem of reconstructing and aligning distorted image sequences that are related by 2D homographies $H_{ij}$. It introduces a ray-traced synthetic dataset with ground-truth homographies and investigates neural feature maps and Video Swin Transformers to perform joint alignment and aggregation for reconstruction. Through comparisons with Deep Residual Sets and various Swin-based aggregation schemes, the study shows attention-based methods better handle outliers and distortions, with softmax-weighted aggregation delivering the strongest results. An end-to-end DIAR pipeline demonstrates the potential of combining neural descriptors with spatio-temporal attention, offering a practical approach to robust image reconstruction in challenging, artifact-rich sequences, while highlighting avenues for refinement such as bundle adjustment for finer alignment.
Abstract
When taking images of some occluded content, one is often faced with the problem that every individual image frame contains unwanted artifacts, but a collection of images contains all relevant information if properly aligned and aggregated. In this paper, we attempt to build a deep learning pipeline that simultaneously aligns a sequence of distorted images and reconstructs them. We create a dataset that contains images with image distortions, such as lighting, specularities, shadows, and occlusion. We create perspective distortions with corresponding ground-truth homographies as labels. We use our dataset to train Swin transformer models to analyze sequential image data. The attention maps enable the model to detect relevant image content and differentiate it from outliers and artifacts. We further explore using neural feature maps as alternatives to classical key point detectors. The feature maps of trained convolutional layers provide dense image descriptors that can be used to find point correspondences between images. We utilize this to compute coarse image alignments and explore its limitations.
