Table of Contents
Fetching ...

Diffusion-Promoted HDR Video Reconstruction

Yuanshen Guan, Ruikang Xu, Mingde Yao, Ruisheng Gao, Lizhi Wang, Zhiwei Xiong

TL;DR

We address HDR video reconstruction from alternating-exposure LDR frames by learning the HDR distribution with a diffusion model. The method introduces HDR-LDM to learn single-frame HDR distribution via tonemapping to a latent space and exposure embedding, TCAM to capture temporal information, and ZiCA to fuse priors. Training proceeds in stages: first optimize HDR-LDM, then train TCAM, and finally refine reconstruction with ZiCA to generate temporally consistent HDR frames. Experiments on DeepHDRVideo and Cinematic datasets show state-of-the-art performance in both objective and perceptual metrics, while leveraging latent diffusion to reduce computational burden for video tasks.

Abstract

High dynamic range (HDR) video reconstruction aims to generate HDR videos from low dynamic range (LDR) frames captured with alternating exposures. Most existing works solely rely on the regression-based paradigm, leading to adverse effects such as ghosting artifacts and missing details in saturated regions. In this paper, we propose a diffusion-promoted method for HDR video reconstruction, termed HDR-V-Diff, which incorporates a diffusion model to capture the HDR distribution. As such, HDR-V-Diff can reconstruct HDR videos with realistic details while alleviating ghosting artifacts. However, the direct introduction of video diffusion models would impose massive computational burden. Instead, to alleviate this burden, we first propose an HDR Latent Diffusion Model (HDR-LDM) to learn the distribution prior of single HDR frames. Specifically, HDR-LDM incorporates a tonemapping strategy to compress HDR frames into the latent space and a novel exposure embedding to aggregate the exposure information into the diffusion process. We then propose a Temporal-Consistent Alignment Module (TCAM) to learn the temporal information as a complement for HDR-LDM, which conducts coarse-to-fine feature alignment at different scales among video frames. Finally, we design a Zero-Init Cross-Attention (ZiCA) mechanism to effectively integrate the learned distribution prior and temporal information for generating HDR frames. Extensive experiments validate that HDR-V-Diff achieves state-of-the-art results on several representative datasets.

Diffusion-Promoted HDR Video Reconstruction

TL;DR

We address HDR video reconstruction from alternating-exposure LDR frames by learning the HDR distribution with a diffusion model. The method introduces HDR-LDM to learn single-frame HDR distribution via tonemapping to a latent space and exposure embedding, TCAM to capture temporal information, and ZiCA to fuse priors. Training proceeds in stages: first optimize HDR-LDM, then train TCAM, and finally refine reconstruction with ZiCA to generate temporally consistent HDR frames. Experiments on DeepHDRVideo and Cinematic datasets show state-of-the-art performance in both objective and perceptual metrics, while leveraging latent diffusion to reduce computational burden for video tasks.

Abstract

High dynamic range (HDR) video reconstruction aims to generate HDR videos from low dynamic range (LDR) frames captured with alternating exposures. Most existing works solely rely on the regression-based paradigm, leading to adverse effects such as ghosting artifacts and missing details in saturated regions. In this paper, we propose a diffusion-promoted method for HDR video reconstruction, termed HDR-V-Diff, which incorporates a diffusion model to capture the HDR distribution. As such, HDR-V-Diff can reconstruct HDR videos with realistic details while alleviating ghosting artifacts. However, the direct introduction of video diffusion models would impose massive computational burden. Instead, to alleviate this burden, we first propose an HDR Latent Diffusion Model (HDR-LDM) to learn the distribution prior of single HDR frames. Specifically, HDR-LDM incorporates a tonemapping strategy to compress HDR frames into the latent space and a novel exposure embedding to aggregate the exposure information into the diffusion process. We then propose a Temporal-Consistent Alignment Module (TCAM) to learn the temporal information as a complement for HDR-LDM, which conducts coarse-to-fine feature alignment at different scales among video frames. Finally, we design a Zero-Init Cross-Attention (ZiCA) mechanism to effectively integrate the learned distribution prior and temporal information for generating HDR frames. Extensive experiments validate that HDR-V-Diff achieves state-of-the-art results on several representative datasets.
Paper Structure (19 sections, 7 equations, 6 figures, 3 tables)

This paper contains 19 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Motivation and visual results. (a) illustrates the t-SNE distribution of input LDR frames, HDR frames, results of the regression-based baseline lanhdr, and our results. Each point represents a 256$\times$256 cropped patch. The output of the baseline falls outside the distribution of HDR frames, whereas our results are significantly closer to ground truth. (b) specifically highlights the differences in reconstruction quality. Results from the regression-based baseline lanhdr show artifacts and noise since its supervision strategy ignores the distribution learning. In contrast, our method exhibits fewer artifacts and noise with more reasonable details. (c) depicts parameter comparisons among the proposed method and video generation diffusion models, such as Make-A-Video makevideo, VideoComposer videocomposer, and Video-LDM videoldm. It also compares these with video editing diffusion models, including Tune-A-Video tuneavideo and Pix2Video related_Ceylan.
  • Figure 2: Pipeline of HDR-V-Diff comprises three key components: (a) depicts details about HDR-LDM, which aims to learn the distribution prior of single HDR frames. We introduce a tonemapping strategy to compress HDR frames into the latent space and a novel exposure embedding to integrate exposure information into the diffusion process. (b) depicts details about TCAM, which learns the temporal information by conducting coarse-to-fine feature alignment in different scales among video frames. (c) depicts details about the final reconstruction process, which leverages the proposed zero-init cross-attention mechanism to integrate the learned distribution prior and temporal information, achieving the high-quality HDR frames reconstruction. (d) illustrates the details of the proposed zero-init cross-attention mechanism.
  • Figure 3: Visual comparisons of different methods on the video frames with 3-exposures from DeepHDRVideo dataset iccv21hdr.
  • Figure 4: Analysis of distribution differences. We analyze the t-SNE distribution on the DeepHDRVideo dataset iccv21hdr and the Cinematic Video dataset cinematic14, including both 2-exposures and 3-exposures. We visualize the distribution of input LDR frames and HDR frames, the results of the regression-based baseline lanhdr, and our results. It can be observed that our results illustrate a closer distribution than the state-of-the-art regression-based method.
  • Figure 5: Visual comparisons of different methods on the video frames with 3-exposures from Cinematic Video dataset cinematic14.
  • ...and 1 more figures