Table of Contents
Fetching ...

SIDAR: Synthetic Image Dataset for Alignment & Restoration

Monika Kwiatkowski, Simon Matern, Olaf Hellwich

TL;DR

The paper presents SIDAR, a synthetic data generator that uses 3D rendering to create richly distorted image sequences with ground-truth homographies and occlusion masks for image alignment and restoration. By texturing a planar surface with arbitrary images and introducing randomized lighting, occluders, and camera configurations, SIDAR produces both aligned and perspective-distorted datasets suitable for deep homography estimation, dense image matching, and restoration tasks. It provides explicit ground-truth data (pixel-wise correspondences and masks) and is configurable as both a data generator and data augmenter, enabling large-scale, diverse training and robust evaluation. Although synthetic and planar, SIDAR offers controlled, scalable data to study and benchmark end-to-end learning approaches for alignment and artifact removal, with potential future extensions to more realistic scenes and new modalities.

Abstract

Image alignment and image restoration are classical computer vision tasks. However, there is still a lack of datasets that provide enough data to train and evaluate end-to-end deep learning models. Obtaining ground-truth data for image alignment requires sophisticated structure-from-motion methods or optical flow systems that often do not provide enough data variance, i.e., typically providing a high number of image correspondences, while only introducing few changes of scenery within the underlying image sequences. Alternative approaches utilize random perspective distortions on existing image data. However, this only provides trivial distortions, lacking the complexity and variance of real-world scenarios. Instead, our proposed data augmentation helps to overcome the issue of data scarcity by using 3D rendering: images are added as textures onto a plane, then varying lighting conditions, shadows, and occlusions are added to the scene. The scene is rendered from multiple viewpoints, generating perspective distortions more consistent with real-world scenarios, with homographies closely resembling those of camera projections rather than randomized homographies. For each scene, we provide a sequence of distorted images with corresponding occlusion masks, homographies, and ground-truth labels. The resulting dataset can serve as a training and evaluation set for a multitude of tasks involving image alignment and artifact removal, such as deep homography estimation, dense image matching, 2D bundle adjustment, inpainting, shadow removal, denoising, content retrieval, and background subtraction. Our data generation pipeline is customizable and can be applied to any existing dataset, serving as a data augmentation to further improve the feature learning of any existing method.

SIDAR: Synthetic Image Dataset for Alignment & Restoration

TL;DR

The paper presents SIDAR, a synthetic data generator that uses 3D rendering to create richly distorted image sequences with ground-truth homographies and occlusion masks for image alignment and restoration. By texturing a planar surface with arbitrary images and introducing randomized lighting, occluders, and camera configurations, SIDAR produces both aligned and perspective-distorted datasets suitable for deep homography estimation, dense image matching, and restoration tasks. It provides explicit ground-truth data (pixel-wise correspondences and masks) and is configurable as both a data generator and data augmenter, enabling large-scale, diverse training and robust evaluation. Although synthetic and planar, SIDAR offers controlled, scalable data to study and benchmark end-to-end learning approaches for alignment and artifact removal, with potential future extensions to more realistic scenes and new modalities.

Abstract

Image alignment and image restoration are classical computer vision tasks. However, there is still a lack of datasets that provide enough data to train and evaluate end-to-end deep learning models. Obtaining ground-truth data for image alignment requires sophisticated structure-from-motion methods or optical flow systems that often do not provide enough data variance, i.e., typically providing a high number of image correspondences, while only introducing few changes of scenery within the underlying image sequences. Alternative approaches utilize random perspective distortions on existing image data. However, this only provides trivial distortions, lacking the complexity and variance of real-world scenarios. Instead, our proposed data augmentation helps to overcome the issue of data scarcity by using 3D rendering: images are added as textures onto a plane, then varying lighting conditions, shadows, and occlusions are added to the scene. The scene is rendered from multiple viewpoints, generating perspective distortions more consistent with real-world scenarios, with homographies closely resembling those of camera projections rather than randomized homographies. For each scene, we provide a sequence of distorted images with corresponding occlusion masks, homographies, and ground-truth labels. The resulting dataset can serve as a training and evaluation set for a multitude of tasks involving image alignment and artifact removal, such as deep homography estimation, dense image matching, 2D bundle adjustment, inpainting, shadow removal, denoising, content retrieval, and background subtraction. Our data generation pipeline is customizable and can be applied to any existing dataset, serving as a data augmentation to further improve the feature learning of any existing method.
Paper Structure (18 sections, 3 theorems, 14 equations, 12 figures, 3 algorithms)

This paper contains 18 sections, 3 theorems, 14 equations, 12 figures, 3 algorithms.

Key Result

Theorem 1

Given two projection matrices $P_i = [I|0]$, $P_j = [A|a]$ and a plane defined as $\pi^TX =0$ with homogeneous coordinates $\pi = (v^T,1)^T$, the homography induced by the plane is:

Figures (12)

  • Figure 1: Illustration of a randomly generated scene using Blender. The plane shows a painting. The white pyramids describe randomly generated cameras; the yellow cone describes a spotlight. Geometric objects serve as occlusions and cast shadows onto the plane.
  • Figure 2: An Illustration of changing the principal distance between the projection center $C$ and the image plane $\mathcal{I}$. The position of the plane $\pi$ and the projection center are fixated. In a) the image plane $\mathcal{I}$ captures all of the content from the painting plane $\pi$.
  • Figure 3: Illustration of a camera sensor with width $w'$ that is aligned with the image with width $w$. The principal distance is given as $f$, and the distance between the camera and image is given as $d$.
  • Figure 4: Illustration of a homography induced by a plane.
  • Figure 5: An illustration for the centering of an area light $l$. By default, Blender aligns new light sources along the $z$-axis. The correct orientation is described by the dotted line $l'$. $\theta$ describes the rotation offset between both configurations. $\theta$ is the angle between the (inverse) location vector $-v$ of the light and the vector $(0,0,-1)$
  • ...and 7 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3