Table of Contents
Fetching ...

VisIRNet: Deep Image Alignment for UAV-taken Visible and Infrared Image Pairs

Sedat Ozer, Alain P. Ndigande

TL;DR

This article proposes a deep-learning-based solution for multimodal image alignment regarding unmanned aerial vehicle (UAV)-taken images that utilizes a two-branch-based convolutional neural network (CNN) based on feature embedding blocks and achieves state-of-the-art results.

Abstract

This paper proposes a deep learning based solution for multi-modal image alignment regarding UAV-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas-Kanade (LK) based solutions for a successful alignment. However, we show that we can achieve state of the art results without using LK-based methods. Our approach carefully utilizes a two-branch based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (ModelA), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (ModelB), we predict the homography matrix directly. Applying alignment on the image corners forces algorithm to match only those four corners as opposed to computing and matching many (key)points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state of the art results, when compared to the existing recent deep LK-based architectures.

VisIRNet: Deep Image Alignment for UAV-taken Visible and Infrared Image Pairs

TL;DR

This article proposes a deep-learning-based solution for multimodal image alignment regarding unmanned aerial vehicle (UAV)-taken images that utilizes a two-branch-based convolutional neural network (CNN) based on feature embedding blocks and achieves state-of-the-art results.

Abstract

This paper proposes a deep learning based solution for multi-modal image alignment regarding UAV-taken images. Many recently proposed state-of-the-art alignment techniques rely on using Lucas-Kanade (LK) based solutions for a successful alignment. However, we show that we can achieve state of the art results without using LK-based methods. Our approach carefully utilizes a two-branch based convolutional neural network (CNN) based on feature embedding blocks. We propose two variants of our approach, where in the first variant (ModelA), we directly predict the new coordinates of only the four corners of the image to be aligned; and in the second one (ModelB), we predict the homography matrix directly. Applying alignment on the image corners forces algorithm to match only those four corners as opposed to computing and matching many (key)points, since the latter may cause many outliers, yielding less accurate alignment. We test our proposed approach on four aerial datasets and obtain state of the art results, when compared to the existing recent deep LK-based architectures.
Paper Structure (8 sections, 10 equations, 7 figures, 6 tables, 3 algorithms)

This paper contains 8 sections, 10 equations, 7 figures, 6 tables, 3 algorithms.

Figures (7)

  • Figure 1: An overview of the image alignment process is shown. On the left, input RGB, $I_{\text{RGB}}$ ($192\times192$ pixels) and IR, $I_{\text{IR}}$ ($128\times128$ pixels) images are shown. The $I_{\text{IR}}$ is shown in pseudocolors. Both images are given as input to the registration stage where the transformation parameters represented by the homography matrix ($H$) are predicted. After the registration process, the $I_{\text{IR}}$ is transformed (warped) onto the $I_{\text{RGB}}$ space by locating the positions of $c_1$, $c_2$, $c_3$, and $c_4$ as $c_{1}^{\prime}$, $c_{2}^{\prime}$, $c_{3}^{\prime}$, and $c_4^{\prime}$. The warped $I_{\text{IR}}$ is overlayed (where $\alpha = 0.4$) on the $I_{\text{RGB}}$.
  • Figure 2: In this figure, we summarize the architectures of various recently proposed deep alignment algorithms including DHN DBLP:journals/corr/DeToneMR16, MHN Le_2020_CVPR, CLKN Chang_2017_CVPR and DLKFM DBLP:journals/corr/abs-2104-11693. While DHN and MHN predict the homography parameters H; CLKN and DLKFM rely on using Lucas-Kanade (LK) based iterative approach and they use feature maps at different resolutions. By doing so, they predict homography in steps H$_{\textbf{i}}$ where each step aims to correct the previous prediction.
  • Figure 3: This figure provides an overview of our proposed network architecture. Two parallel branches including RGB branch and IR branch (feature embedding blocks) extract the salient features for RGB and IR images, respectively. Those features are, then channel-wise concatenated and fed into the regression block for direct (ModelB) or indirect (ModelA) homography prediction. I.e., the model can be trained for learning the homography matrix in ModelB or to regress the corresponding coordinates of the four corners of the input IR image on the RGB image in ModelA. The output is 8 dimensional vector (for H), if ModelB is used; and it is 8 dimensional vector where those 8 values correspond to the $(x,y)$ coordinates of the 4 corners of the IR image, if modelA is used. The details of the feature embedding block are given on the top corner of the figure (also see Table \ref{['table:BackboneStructure']}). The details of the regression block are given in the lower right corner of the figure (also see Table \ref{['table:NetHeadStructure']}).
  • Figure 4: This figure shows how to select initial corner points on the registered image pairs and how to generate the training data. First a random image patch is taken from the originally registered IR image. Then the random corners of that patch is transformed into fixed coordinates and after that, the H matrix (and its inverse) performing that transformation is computed.
  • Figure 5: This figure shows qualitative results on sample image pairs taken from different datasets. The first two columns show the input image pairs for the algorithms. The target image is $192\times192$ pixels and the source image is $128\times128$ pixels (which covers a scene that is a subset of the target image). The third column shows the ground truth version ($192\times192$ pixels) of the source image on the coordinate system of the target image after being warped. The fourth column shows the ground truth (warped) where the source image is overlayed on the target image ($192\times192$ pixels). The remaining 6 columns show the overlayed results ($192\times192$ pixels), after applying registration with the algorithms in the order of SIFT, DHN, MHN, CLKN, DLKFM and our approach, respectively. Visually, each algorithms' result can be compared to the image in the fourth column.
  • ...and 2 more figures