Table of Contents
Fetching ...

Wavelet-Domain Masked Image Modeling for Color-Consistent HDR Video Reconstruction

Yang Zhang, Zhangkai Ni, Wenhan Yang, Hanli Wang

TL;DR

WMNet addresses HDR video reconstruction from LDR inputs by jointly tackling color fidelity and temporal inconsistency. It introduces Wavelet-domain Masked Image Modeling (W-MIM) with curriculum masking for robust color restoration, and augments temporal coherence with Temporal Mixture of Experts (T-MoE) and a scene-specific Dynamic Memory Module (DMM). The method is validated on a restructured HDRTV4K-Scene dataset and shows state-of-the-art performance across multiple metrics and strong generalization to RealHDRV, with favorable subjective user study results. These contributions provide a scalable approach for high-quality HDR video reconstruction with improved color accuracy and temporal stability, along with a practical scene-based benchmark for future research.

Abstract

High Dynamic Range (HDR) video reconstruction aims to recover fine brightness, color, and details from Low Dynamic Range (LDR) videos. However, existing methods often suffer from color inaccuracies and temporal inconsistencies. To address these challenges, we propose WMNet, a novel HDR video reconstruction network that leverages Wavelet domain Masked Image Modeling (W-MIM). WMNet adopts a two-phase training strategy: In Phase I, W-MIM performs self-reconstruction pre-training by selectively masking color and detail information in the wavelet domain, enabling the network to develop robust color restoration capabilities. A curriculum learning scheme further refines the reconstruction process. Phase II fine-tunes the model using the pre-trained weights to improve the final reconstruction quality. To improve temporal consistency, we introduce the Temporal Mixture of Experts (T-MoE) module and the Dynamic Memory Module (DMM). T-MoE adaptively fuses adjacent frames to reduce flickering artifacts, while DMM captures long-range dependencies, ensuring smooth motion and preservation of fine details. Additionally, since existing HDR video datasets lack scene-based segmentation, we reorganize HDRTV4K into HDRTV4K-Scene, establishing a new benchmark for HDR video reconstruction. Extensive experiments demonstrate that WMNet achieves state-of-the-art performance across multiple evaluation metrics, significantly improving color fidelity, temporal coherence, and perceptual quality. The code is available at: https://github.com/eezkni/WMNet

Wavelet-Domain Masked Image Modeling for Color-Consistent HDR Video Reconstruction

TL;DR

WMNet addresses HDR video reconstruction from LDR inputs by jointly tackling color fidelity and temporal inconsistency. It introduces Wavelet-domain Masked Image Modeling (W-MIM) with curriculum masking for robust color restoration, and augments temporal coherence with Temporal Mixture of Experts (T-MoE) and a scene-specific Dynamic Memory Module (DMM). The method is validated on a restructured HDRTV4K-Scene dataset and shows state-of-the-art performance across multiple metrics and strong generalization to RealHDRV, with favorable subjective user study results. These contributions provide a scalable approach for high-quality HDR video reconstruction with improved color accuracy and temporal stability, along with a practical scene-based benchmark for future research.

Abstract

High Dynamic Range (HDR) video reconstruction aims to recover fine brightness, color, and details from Low Dynamic Range (LDR) videos. However, existing methods often suffer from color inaccuracies and temporal inconsistencies. To address these challenges, we propose WMNet, a novel HDR video reconstruction network that leverages Wavelet domain Masked Image Modeling (W-MIM). WMNet adopts a two-phase training strategy: In Phase I, W-MIM performs self-reconstruction pre-training by selectively masking color and detail information in the wavelet domain, enabling the network to develop robust color restoration capabilities. A curriculum learning scheme further refines the reconstruction process. Phase II fine-tunes the model using the pre-trained weights to improve the final reconstruction quality. To improve temporal consistency, we introduce the Temporal Mixture of Experts (T-MoE) module and the Dynamic Memory Module (DMM). T-MoE adaptively fuses adjacent frames to reduce flickering artifacts, while DMM captures long-range dependencies, ensuring smooth motion and preservation of fine details. Additionally, since existing HDR video datasets lack scene-based segmentation, we reorganize HDRTV4K into HDRTV4K-Scene, establishing a new benchmark for HDR video reconstruction. Extensive experiments demonstrate that WMNet achieves state-of-the-art performance across multiple evaluation metrics, significantly improving color fidelity, temporal coherence, and perceptual quality. The code is available at: https://github.com/eezkni/WMNet
Paper Structure (31 sections, 17 equations, 6 figures, 9 tables)

This paper contains 31 sections, 17 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison of different masking strategies on color gamut distribution. (a) Original HDR frame and its corresponding color gamut; (b) Spatial-domain masked HDR frame and its color gamut; (c) Wavelet-domain masked HDR frame and its color gamut. Zoom in for clearer visualization of color differences.
  • Figure 2: Overall framework of WMNet. Phase I performs self-reconstruction pre-training using Wavelet-domain Masked Image Modeling (W-MIM) to enhance color and detail restoration. Phase II fine-tunes the encoder for HDR video reconstruction and incorporates the Temporal Mixture of Experts (T-MoE) and Dynamic Memory Module (DMM) to improve temporal consistency.
  • Figure 3: DMM processes input features based on the scene and consists of two key components: memory matching and memory updating. In memory matching, the input feature $F^{\prime}_{t}$ and the retrieved memory feature $M_{s}$ are used to compute cross-attention, generating the enhanced feature $\hat{F}_{t}$. Memory updating then refines $\hat{F}_{t}$ into $\hat{M}_s$ and adds it to the memory dictionary, ensuring that the dictionary remains both real-time and globally representative.
  • Figure 4: Qualitative results on the HDRTV4K-Scene dataset, please zoom in for a better view of details.
  • Figure 5: Qualitative results on the HDRTV4K-LongScene dataset, please zoom in for a better view of details.
  • ...and 1 more figures