Table of Contents
Fetching ...

HSTR-Net: Reference Based Video Super-resolution with Dual Cameras

H. Umut Suluhan, Abdullah Enes Doruk, Hasan F. Ates, Bahadir K. Gunturk

TL;DR

Simulations show that the proposed model provides significant improvement over existing reference-based SR techniques in terms of PSNR and SSIM metrics, and exhibits sufficient frames per second (FPS) for aerial monitoring when deployed on a power-constrained drone equipped with dual cameras.

Abstract

High-spatio-temporal resolution (HSTR) video recording plays a crucial role in enhancing various imagery tasks that require fine-detailed information. State-of-the-art cameras provide this required high frame-rate and high spatial resolution together, albeit at a high cost. To alleviate this issue, this paper proposes a dual camera system for the generation of HSTR video using reference-based super-resolution (RefSR). One camera captures high spatial resolution low frame rate (HSLF) video while the other captures low spatial resolution high frame rate (LSHF) video simultaneously for the same scene. A novel deep learning architecture is proposed to fuse HSLF and LSHF video feeds and synthesize HSTR video frames. The proposed model combines optical flow estimation and (channel-wise and spatial) attention mechanisms to capture the fine motion and complex dependencies between frames of the two video feeds. Simulations show that the proposed model provides significant improvement over existing reference-based SR techniques in terms of PSNR and SSIM metrics. The method also exhibits sufficient frames per second (FPS) for aerial monitoring when deployed on a power-constrained drone equipped with dual cameras.

HSTR-Net: Reference Based Video Super-resolution with Dual Cameras

TL;DR

Simulations show that the proposed model provides significant improvement over existing reference-based SR techniques in terms of PSNR and SSIM metrics, and exhibits sufficient frames per second (FPS) for aerial monitoring when deployed on a power-constrained drone equipped with dual cameras.

Abstract

High-spatio-temporal resolution (HSTR) video recording plays a crucial role in enhancing various imagery tasks that require fine-detailed information. State-of-the-art cameras provide this required high frame-rate and high spatial resolution together, albeit at a high cost. To alleviate this issue, this paper proposes a dual camera system for the generation of HSTR video using reference-based super-resolution (RefSR). One camera captures high spatial resolution low frame rate (HSLF) video while the other captures low spatial resolution high frame rate (LSHF) video simultaneously for the same scene. A novel deep learning architecture is proposed to fuse HSLF and LSHF video feeds and synthesize HSTR video frames. The proposed model combines optical flow estimation and (channel-wise and spatial) attention mechanisms to capture the fine motion and complex dependencies between frames of the two video feeds. Simulations show that the proposed model provides significant improvement over existing reference-based SR techniques in terms of PSNR and SSIM metrics. The method also exhibits sufficient frames per second (FPS) for aerial monitoring when deployed on a power-constrained drone equipped with dual cameras.
Paper Structure (15 sections, 4 equations, 8 figures, 8 tables)

This paper contains 15 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overall architecture of HSTR-Net. Motion estimation and compensation are applied to $LR$ and $REF$ frames to generate a warped reference frame. An attention-based patch-matching scheme is applied to input frames to make use of correspondences between similar textures. A contextual representation module extracts multi-scale features by applying deformable convolution at various resolutions. Lastly, a fusion and reconstruction module is used to generate the final $HR$ output frame.
  • Figure 2: Motion Estimation Module
  • Figure 3: Contextual Representation Module
  • Figure 4: High-level diagram of patch matching module. Given $LR$ and $REF$ frames, patch-partition is applied by a convolution layer. Later, patch-matching groups are applied to produce patch-matching outputs. Each output is used as an intermediate input for the next group and preserved for the final output. The last two groups apply patch-merging to decrease resolution by a factor of 2.
  • Figure 5: Auto-encoder architecture of fusion module. The contextual features $C_i$ and patch correspondences $PM_{i-1}$ are concatenated with appropriate encoder layer outputs $S_i$ with matching resolutions and provided as input to the next encoder layer
  • ...and 3 more figures