Table of Contents
Fetching ...

Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

Caixin Wang, Jie Zhang, Matthew A. Wilson, Ralph Etienne-Cummings

TL;DR

Pix2HDR tackles the trade-off between dynamic range and temporal resolution in HDR video by coupling pixel-wise MPVE sampling on a PE-CMOS sensor with a two-stage deep learning synthesis (LDR-net for spatiotemporal upsampling and HDR-net for fusion). The MPVE pattern uses multi-phase exposures across 2×2 pixel patches to boost temporal bandwidth and dynamic range while mitigating aliasing. The LDR-HDR networks are trained end-to-end on camera-simulated measurements derived from public HDR videos, achieving real-time HDR video synthesis at up to 400 Hz and 2 ms temporal resolution for HDR frames, with 12–24 dB DR improvement depending on configuration. The results show substantial improvements over frame-based and interleaved exposures in PSNR/SSIM and HDR-VDP metrics, while maintaining high spatial resolution and robustness to low-light and bright-background conditions. This approach enables high-speed HDR video for dynamic scenes in robotics, autonomous systems, and computational imaging.

Abstract

Accurately capturing dynamic scenes with wide-ranging motion and light intensity is crucial for many vision applications. However, acquiring high-speed high dynamic range (HDR) video is challenging because the camera's frame rate restricts its dynamic range. Existing methods sacrifice speed to acquire multi-exposure frames. Yet, misaligned motion in these frames can still pose complications for HDR fusion algorithms, resulting in artifacts. Instead of frame-based exposures, we sample the videos using individual pixels at varying exposures and phase offsets. Implemented on a monochrome pixel-wise programmable image sensor, our sampling pattern simultaneously captures fast motion at a high dynamic range. We then transform pixel-wise outputs into an HDR video using end-to-end learned weights from deep neural networks, achieving high spatiotemporal resolution with minimized motion blurring. We demonstrate aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under low-light conditions and against bright backgrounds - both challenging conditions for conventional cameras. By combining the versatility of pixel-wise sampling patterns with the strength of deep neural networks at decoding complex scenes, our method greatly enhances the vision system's adaptability and performance in dynamic conditions.

Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis approach for high-speed HDR videos

TL;DR

Pix2HDR tackles the trade-off between dynamic range and temporal resolution in HDR video by coupling pixel-wise MPVE sampling on a PE-CMOS sensor with a two-stage deep learning synthesis (LDR-net for spatiotemporal upsampling and HDR-net for fusion). The MPVE pattern uses multi-phase exposures across 2×2 pixel patches to boost temporal bandwidth and dynamic range while mitigating aliasing. The LDR-HDR networks are trained end-to-end on camera-simulated measurements derived from public HDR videos, achieving real-time HDR video synthesis at up to 400 Hz and 2 ms temporal resolution for HDR frames, with 12–24 dB DR improvement depending on configuration. The results show substantial improvements over frame-based and interleaved exposures in PSNR/SSIM and HDR-VDP metrics, while maintaining high spatial resolution and robustness to low-light and bright-background conditions. This approach enables high-speed HDR video for dynamic scenes in robotics, autonomous systems, and computational imaging.

Abstract

Accurately capturing dynamic scenes with wide-ranging motion and light intensity is crucial for many vision applications. However, acquiring high-speed high dynamic range (HDR) video is challenging because the camera's frame rate restricts its dynamic range. Existing methods sacrifice speed to acquire multi-exposure frames. Yet, misaligned motion in these frames can still pose complications for HDR fusion algorithms, resulting in artifacts. Instead of frame-based exposures, we sample the videos using individual pixels at varying exposures and phase offsets. Implemented on a monochrome pixel-wise programmable image sensor, our sampling pattern simultaneously captures fast motion at a high dynamic range. We then transform pixel-wise outputs into an HDR video using end-to-end learned weights from deep neural networks, achieving high spatiotemporal resolution with minimized motion blurring. We demonstrate aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under low-light conditions and against bright backgrounds - both challenging conditions for conventional cameras. By combining the versatility of pixel-wise sampling patterns with the strength of deep neural networks at decoding complex scenes, our method greatly enhances the vision system's adaptability and performance in dynamic conditions.
Paper Structure (28 sections, 26 equations, 18 figures, 1 table)

This paper contains 28 sections, 26 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: The overview of the Pix2HDR acquisition / synthesis. Pix2HDR acquires the HDR scene using the multi-phase varying exposure (MPVE) pixel-wise sampling pattern, implemented on a CMOS image sensor (PE-CMOS). MPVE configures pixels into different exposures, speeds, and phase offsets to enhance temporal resolution and dynamic range. Pix2HDR synthesizes a high spatiotemporal HDR video from pixel-wise outputs using weights obtained through a deep neural network through end-to-end training, achieving high spatiotemporal resolution with minimized motion blurring.
  • Figure 2: Multi-phase sampling enhances the temporal resolution without increasing sampling speed. A. Conventional camera samples all the pixels ($y_1,...,y_4$) concurrently (global shutter) or in fast line sequences (rolling shutter) with pixel exposure of $T_E$ and sampling rate at $1/T_E$. B. Frequency spectrum of the averaged pixel value, $y_{\rm avg}$. Its Nyquist bandwidth is limited to $1/(2T_E)$ and suffers from a significant amount of temporal aliasing. C. In multi-phase sampling, pixels of exposure $T_E$ are phase-offsetted in multiples of $T_E/4$. D. Without increasing the sampling rate, multi-phase exposures extend the $y_{\rm avg}$ bandwidth by four times to $2/T_E$ and mitigate temporal aliasing by pushing the replica spectra to higher frequencies.
  • Figure 3: Exposure duration determines pixel SNR at sampling high-speed events. To illustrate: Left. Transient events of amplitude A with baseline intensity B, and different falling time constants ($\tau$) of 0.6, 1.2 and 2.4 ms. Right. Pixel SNR with respect to the exposure time, $T_E$, for the signals on the left. Each curve is normalized with the maximum SNR set to 100%. The SNR increases with longer exposure time but drops as extra integration time adds more shot noise than signal power.
  • Figure 4: MPVE sampling pattern uses varying exposures (short, medium, and long) at different phase offsets to maximize SNR for signals of different temporal characteristics.
  • Figure 5: Temporal comparison of MPVE vs other sampling patterns. A. Sampling a signal that consists of a slow variation with two closely spaced fast events, the MPVE guarantees aliasing-free sampling at high time resolution. Long pixel exposure also increases the SNR of the slow variations and expands the dynamic range. B. Sampling the same signal using time-interleaved long-short exposure will introduce aliasing, unable to resolve the timing of these two events. C. Multiple-on coded exposure pattern distributed randomly also creates aliasing: events appearing at either of the red colored frames generate the same reading at all pixel outputs, creating ambiguity in the events’ timing.
  • ...and 13 more figures