Table of Contents
Fetching ...

Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network

Yong Shu, Liquan Shen, Xiangyu Hu, Mengyao Li, Zihao Zhou

TL;DR

This work tackles HDR video reconstruction from alternating-exposure sequences in real-world settings, addressing the lack of large-scale real data. It introduces Real-HDRV, a large real-world dataset with 500 LDRs-HDRs video pairs and diverse scenes and motions, and presents a two-stage alignment network consisting of a Global Alignment Module (GAM) and a Local Alignment Module (LAM) to handle global and local motion, respectively. GAM uses pre-defined offset bases to estimate global motion, while LAM employs a multi-scale pyramid and adaptive separable convolution to align features coherently; a fusion-reconstruction cascade then yields HDR video frames. Experiments show that models trained on Real-HDRV generalize better to real scenes than synthetic-trained models, and the proposed two-stage method achieves state-of-the-art HDR video reconstruction with favorable efficiency, highlighting Real-HDRV’s practical impact for real-world HDR video tasks.

Abstract

As an important and practical way to obtain high dynamic range (HDR) video, HDR video reconstruction from sequences with alternating exposures is still less explored, mainly due to the lack of large-scale real-world datasets. Existing methods are mostly trained on synthetic datasets, which perform poorly in real scenes. In this work, to facilitate the development of real-world HDR video reconstruction, we present Real-HDRV, a large-scale real-world benchmark dataset for HDR video reconstruction, featuring various scenes, diverse motion patterns, and high-quality labels. Specifically, our dataset contains 500 LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels, covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge, our dataset is the largest real-world HDR video reconstruction dataset. Correspondingly, we propose an end-to-end network for HDR video reconstruction, where a novel two-stage strategy is designed to perform alignment sequentially. Specifically, the first stage performs global alignment with the adaptively estimated global offsets, reducing the difficulty of subsequent alignment. The second stage implicitly performs local alignment in a coarse-to-fine manner at the feature level using the adaptive separable convolution. Extensive experiments demonstrate that: (1) models trained on our dataset can achieve better performance on real scenes than those trained on synthetic datasets; (2) our method outperforms previous state-of-the-art methods. Our dataset is available at https://github.com/yungsyu99/Real-HDRV.

Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network

TL;DR

This work tackles HDR video reconstruction from alternating-exposure sequences in real-world settings, addressing the lack of large-scale real data. It introduces Real-HDRV, a large real-world dataset with 500 LDRs-HDRs video pairs and diverse scenes and motions, and presents a two-stage alignment network consisting of a Global Alignment Module (GAM) and a Local Alignment Module (LAM) to handle global and local motion, respectively. GAM uses pre-defined offset bases to estimate global motion, while LAM employs a multi-scale pyramid and adaptive separable convolution to align features coherently; a fusion-reconstruction cascade then yields HDR video frames. Experiments show that models trained on Real-HDRV generalize better to real scenes than synthetic-trained models, and the proposed two-stage method achieves state-of-the-art HDR video reconstruction with favorable efficiency, highlighting Real-HDRV’s practical impact for real-world HDR video tasks.

Abstract

As an important and practical way to obtain high dynamic range (HDR) video, HDR video reconstruction from sequences with alternating exposures is still less explored, mainly due to the lack of large-scale real-world datasets. Existing methods are mostly trained on synthetic datasets, which perform poorly in real scenes. In this work, to facilitate the development of real-world HDR video reconstruction, we present Real-HDRV, a large-scale real-world benchmark dataset for HDR video reconstruction, featuring various scenes, diverse motion patterns, and high-quality labels. Specifically, our dataset contains 500 LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels, covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge, our dataset is the largest real-world HDR video reconstruction dataset. Correspondingly, we propose an end-to-end network for HDR video reconstruction, where a novel two-stage strategy is designed to perform alignment sequentially. Specifically, the first stage performs global alignment with the adaptively estimated global offsets, reducing the difficulty of subsequent alignment. The second stage implicitly performs local alignment in a coarse-to-fine manner at the feature level using the adaptive separable convolution. Extensive experiments demonstrate that: (1) models trained on our dataset can achieve better performance on real scenes than those trained on synthetic datasets; (2) our method outperforms previous state-of-the-art methods. Our dataset is available at https://github.com/yungsyu99/Real-HDRV.
Paper Structure (28 sections, 7 equations, 8 figures, 7 tables)

This paper contains 28 sections, 7 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Row 1 shows a real-world sample from the Chen21 datasetchen2021hdr. Row 2-3 show the HDR frames reconstructed by models trained on the synthetic dataset chen2021hdr and our Real-HDRV, respectively. Obviously, models trained on our dataset are able to recover more and better details of the over-exposed regions.
  • Figure 2: (a) Some typical scenes in our dataset, which can be categorized into 4 categories: indoor daytime (ID), indoor nighttime (IN), outdoor daytime (OD), and outdoor nighttime (ON) scenes. (b) Our dataset contains three kinds of motion: global motion (where only the camera is moving), local motion (where only the foreground is moving), and full motion (where both foreground and camera are moving). (c) Scene and motion distributions of our dataset. (d) Diversity comparison: our dataset vs. the Chen21 dataset chen2021hdr. (e) Statistics of motion directions in our dataset. We plot a circular histogram, where the color of each bin represents the direction of motion, and the height of the bar represents the proportion of specific directions to all the directions. The per-pixel flow in each frame is computed via RAFTTeed2020.
  • Figure 3: The architecture of our proposed network.
  • Figure 4: The architecture of local alignment module (LAM).
  • Figure 5: The architecture of fusion module.
  • ...and 3 more figures