gQIR: Generative Quanta Image Reconstruction

Aryan Garg; Sizhuo Ma; Mohit Gupta

gQIR: Generative Quanta Image Reconstruction

Aryan Garg, Sizhuo Ma, Mohit Gupta

TL;DR

This work presents an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging, and substantially improves perceptual quality over classical and modern learning-based baselines.

Abstract

Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{https://github.com/Aryan-Garg/gQIR}{https://github.com/Aryan-Garg/gQIR}.

gQIR: Generative Quanta Image Reconstruction

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 8 figures, 5 tables)

This paper contains 14 sections, 9 equations, 8 figures, 5 tables.

Introduction
Related Work
Methodology
Image Formation Model
Stage 1: Quanta Aligned VAE
Stage 2: Perceptual Enhancement
Stage 3: Latent Burst Imaging
Implementation Details
Experiments
Quantitative Evaluation
Qualitative Evaluation
Ablation Studies
Real World Testing.
Limitations and Outlook

Figures (8)

Figure 1: gQIR: Photorealistic single image and burst reconstruction from ultra–high-speed color SPADs. Our pipeline reconstructs high-quality RGB images from 3-bit color-SPAD CFA nano-bursts (left) and merges SPAD photon cubes into temporally consistent bursts (right). From photon-starved inputs captured at 10k–50k fps in extreme, out-of-domain scenes, gQIR recovers sharp textures, accurate color, and coherent structure by leveraging a generative prior. For burst sequences up to 100k fps, FusionViT aligns and dynamically merges quanta latents, outperforming traditional and learning-based methods in both fidelity and perceptualness under motion.
Figure 2: Overview of gQIR. Three-stage framework for quanta burst reconstruction: (S1) a quanta-aligned VAE for joint denoising and demosaicing of SPAD nano-bursts, (S2) an adversarially finetuned LoRA hu2022lora latent U-Net initialized with stable diffusion stablediffusion_sd_original weights for perceptual enhancement, and (S3) a latent burst FusionViT for motion-aware spatio-temporal fusion of burst of nano-burst inputs.
Figure 3: Qualitative comparison – single 3-bit frame reconstructions. Conventional finetuned baselines over-smooth high-frequency structures, especially in distant depth planes and textured regions, whereas gQIR preserves sharper details and more faithful facial features, benefitting from the inclusion of FFHQ faces StyleGAN_ffhq in the training set.
Figure 4: Encoder collapse under predegradation removal loss supirhypir. The encoder $\mathcal{E}_{\phi^*}$ learns a perceptually meaningless shortcut thus producing constant outputs. Since the trainable encoder controls both, the supervision and prediction terms, the training curve quickly converges to a degenerate optimum without our proposed modifications.
Figure 5: Dynamic spatio-temporal latent burst merging. Naive averaging of flow-aligned burst latents yields blur under scene motion. FusionViT instead adaptively weights latents by motion and proximity to the reference, producing a sharper output.
...and 3 more figures

gQIR: Generative Quanta Image Reconstruction

TL;DR

Abstract

gQIR: Generative Quanta Image Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)