Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

Caixin Wang; Jie Zhang; Matthew A. Wilson; Ralph Etienne-Cummings

Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

Caixin Wang, Jie Zhang, Matthew A. Wilson, Ralph Etienne-Cummings

TL;DR

Pix2HDR tackles the trade-off between dynamic range and temporal resolution in HDR video by coupling pixel-wise MPVE sampling on a PE-CMOS sensor with a two-stage deep learning synthesis (LDR-net for spatiotemporal upsampling and HDR-net for fusion). The MPVE pattern uses multi-phase exposures across 2×2 pixel patches to boost temporal bandwidth and dynamic range while mitigating aliasing. The LDR-HDR networks are trained end-to-end on camera-simulated measurements derived from public HDR videos, achieving real-time HDR video synthesis at up to 400 Hz and 2 ms temporal resolution for HDR frames, with 12–24 dB DR improvement depending on configuration. The results show substantial improvements over frame-based and interleaved exposures in PSNR/SSIM and HDR-VDP metrics, while maintaining high spatial resolution and robustness to low-light and bright-background conditions. This approach enables high-speed HDR video for dynamic scenes in robotics, autonomous systems, and computational imaging.

Abstract

Accurately capturing dynamic scenes with wide-ranging motion and light intensity is crucial for many vision applications. However, acquiring high-speed high dynamic range (HDR) video is challenging because the camera's frame rate restricts its dynamic range. Existing methods sacrifice speed to acquire multi-exposure frames. Yet, misaligned motion in these frames can still pose complications for HDR fusion algorithms, resulting in artifacts. Instead of frame-based exposures, we sample the videos using individual pixels at varying exposures and phase offsets. Implemented on a monochrome pixel-wise programmable image sensor, our sampling pattern simultaneously captures fast motion at a high dynamic range. We then transform pixel-wise outputs into an HDR video using end-to-end learned weights from deep neural networks, achieving high spatiotemporal resolution with minimized motion blurring. We demonstrate aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under low-light conditions and against bright backgrounds - both challenging conditions for conventional cameras. By combining the versatility of pixel-wise sampling patterns with the strength of deep neural networks at decoding complex scenes, our method greatly enhances the vision system's adaptability and performance in dynamic conditions.

Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

TL;DR

Abstract

Paper Structure (28 sections, 26 equations, 18 figures, 1 table)

This paper contains 28 sections, 26 equations, 18 figures, 1 table.

Introduction
Previous work on HDR imaging
Acquiring and merging multiple frames of varying exposure
HDR imaging through pixel-wise exposure modulation
Overview
Multi-phase varying exposure (MPVE) pixel-wise sampling configuration
Eliminating temporal aliasing with pixel-wise phase offset
Maximizing SNR for high-speed signals
Minimizing blurring and extending the dynamic range
LDR-HDR networks for synthesizing HDR videos from pixel-wise outputs.
Training data generation: transforming ground truth HDR video to pixel outputs
LDR-net: Mapping pixel-wise outputs to high spatio-temporal resolution videos at different exposure levels
LDR-net architecture
training
Loss function
...and 13 more sections

Figures (18)

Figure 1: The overview of the Pix2HDR acquisition / synthesis. Pix2HDR acquires the HDR scene using the multi-phase varying exposure (MPVE) pixel-wise sampling pattern, implemented on a CMOS image sensor (PE-CMOS). MPVE configures pixels into different exposures, speeds, and phase offsets to enhance temporal resolution and dynamic range. Pix2HDR synthesizes a high spatiotemporal HDR video from pixel-wise outputs using weights obtained through a deep neural network through end-to-end training, achieving high spatiotemporal resolution with minimized motion blurring.
Figure 2: Multi-phase sampling enhances the temporal resolution without increasing sampling speed. A. Conventional camera samples all the pixels ($y_1,...,y_4$) concurrently (global shutter) or in fast line sequences (rolling shutter) with pixel exposure of $T_E$ and sampling rate at $1/T_E$. B. Frequency spectrum of the averaged pixel value, $y_{\rm avg}$. Its Nyquist bandwidth is limited to $1/(2T_E)$ and suffers from a significant amount of temporal aliasing. C. In multi-phase sampling, pixels of exposure $T_E$ are phase-offsetted in multiples of $T_E/4$. D. Without increasing the sampling rate, multi-phase exposures extend the $y_{\rm avg}$ bandwidth by four times to $2/T_E$ and mitigate temporal aliasing by pushing the replica spectra to higher frequencies.
Figure 3: Exposure duration determines pixel SNR at sampling high-speed events. To illustrate: Left. Transient events of amplitude A with baseline intensity B, and different falling time constants ($\tau$) of 0.6, 1.2 and 2.4 ms. Right. Pixel SNR with respect to the exposure time, $T_E$, for the signals on the left. Each curve is normalized with the maximum SNR set to 100%. The SNR increases with longer exposure time but drops as extra integration time adds more shot noise than signal power.
Figure 4: MPVE sampling pattern uses varying exposures (short, medium, and long) at different phase offsets to maximize SNR for signals of different temporal characteristics.
Figure 5: Temporal comparison of MPVE vs other sampling patterns. A. Sampling a signal that consists of a slow variation with two closely spaced fast events, the MPVE guarantees aliasing-free sampling at high time resolution. Long pixel exposure also increases the SNR of the slow variations and expands the dynamic range. B. Sampling the same signal using time-interleaved long-short exposure will introduce aliasing, unable to resolve the timing of these two events. C. Multiple-on coded exposure pattern distributed randomly also creates aliasing: events appearing at either of the red colored frames generate the same reading at all pixel outputs, creating ambiguity in the events’ timing.
...and 13 more figures

Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

TL;DR

Abstract

Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

Authors

TL;DR

Abstract

Table of Contents

Figures (18)