Table of Contents
Fetching ...

Streaming quanta sensors for online, high-performance imaging and vision

Tianyi Zhang, Matthew Dutson, Vivek Boominathan, Mohit Gupta, Ashok Veeraraghavan

TL;DR

The paper tackles the data-bandwidth and processing bottlenecks of ultra-fast SPAD-based quanta image sensors (QIS) by introducing a compact streaming representation that updates per binary frame and stores multi-time-scale information. A feed-forward neural network reconstructs intensity frames from an 8-channel streaming exposure stack in real time (10–30 fps), yielding ~100× bandwidth reductions and 10^4–10^5× computational speedups over prior methods. The approach enables near-real-time image reconstruction on QIS and supports downstream vision tasks (detection, tracking, pose estimation) with real-time performance, validated on synthetic and real data using a semi-realistic QIS dataset. The work demonstrates how streaming perception can bridge high-speed sensing and practical vision systems, while outlining limitations and avenues for end-to-end streaming architectures and alternative representations. Overall, this method significantly lowers data and compute requirements for QIS-enabled vision, making real-time QIS-enabled imaging and inference feasible in resource-constrained settings.

Abstract

Recently quanta image sensors (QIS) -- ultra-fast, zero-read-noise binary image sensors -- have demonstrated remarkable imaging capabilities in many challenging scenarios. Despite their potential, the adoption of these sensors is severely hampered by (a) high data rates and (b) the need for new computational pipelines to handle the unconventional raw data. We introduce a simple, low-bandwidth computational pipeline to address these challenges. Our approach is based on a novel streaming representation with a small memory footprint, efficiently capturing intensity information at multiple temporal scales. Updating the representation requires only 16 floating-point operations/pixel, which can be efficiently computed online at the native frame rate of the binary frames. We use a neural network operating on this representation to reconstruct videos in real-time (10-30 fps). We illustrate why such representation is well-suited for these emerging sensors, and how it offers low latency and high frame rate while retaining flexibility for downstream computer vision. Our approach results in significant data bandwidth reductions ~100X and real-time image reconstruction and computer vision -- $10^4$-$10^5$ reduction in computation than existing state-of-the-art approach while maintaining comparable quality. To the best of our knowledge, our approach is the first to achieve online, real-time image reconstruction on QIS.

Streaming quanta sensors for online, high-performance imaging and vision

TL;DR

The paper tackles the data-bandwidth and processing bottlenecks of ultra-fast SPAD-based quanta image sensors (QIS) by introducing a compact streaming representation that updates per binary frame and stores multi-time-scale information. A feed-forward neural network reconstructs intensity frames from an 8-channel streaming exposure stack in real time (10–30 fps), yielding ~100× bandwidth reductions and 10^4–10^5× computational speedups over prior methods. The approach enables near-real-time image reconstruction on QIS and supports downstream vision tasks (detection, tracking, pose estimation) with real-time performance, validated on synthetic and real data using a semi-realistic QIS dataset. The work demonstrates how streaming perception can bridge high-speed sensing and practical vision systems, while outlining limitations and avenues for end-to-end streaming architectures and alternative representations. Overall, this method significantly lowers data and compute requirements for QIS-enabled vision, making real-time QIS-enabled imaging and inference feasible in resource-constrained settings.

Abstract

Recently quanta image sensors (QIS) -- ultra-fast, zero-read-noise binary image sensors -- have demonstrated remarkable imaging capabilities in many challenging scenarios. Despite their potential, the adoption of these sensors is severely hampered by (a) high data rates and (b) the need for new computational pipelines to handle the unconventional raw data. We introduce a simple, low-bandwidth computational pipeline to address these challenges. Our approach is based on a novel streaming representation with a small memory footprint, efficiently capturing intensity information at multiple temporal scales. Updating the representation requires only 16 floating-point operations/pixel, which can be efficiently computed online at the native frame rate of the binary frames. We use a neural network operating on this representation to reconstruct videos in real-time (10-30 fps). We illustrate why such representation is well-suited for these emerging sensors, and how it offers low latency and high frame rate while retaining flexibility for downstream computer vision. Our approach results in significant data bandwidth reductions ~100X and real-time image reconstruction and computer vision -- - reduction in computation than existing state-of-the-art approach while maintaining comparable quality. To the best of our knowledge, our approach is the first to achieve online, real-time image reconstruction on QIS.
Paper Structure (30 sections, 6 equations, 12 figures, 1 table)

This paper contains 30 sections, 6 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Top: We propose a novel online processing architecture for QIS, which consists of streaming an extreme multi-exposure stack, and a modified U-Net with ResNet blocks for performing the reconstruction. The approach results in good image reconstruction performance. Bottom: The exposure stack is computed via a streaming update (figure bottom). Such a representation ensures information up to the most recent time instant is always available and decouples exposure from frame-read out. The sensor produces a consistent and continuously updated multi-exposure stack for any downstream request with near-zero lag.
  • Figure 2: The streamed multi-exposure set. (a) shows our streamed multi-exposure representation (top left: long exposure, bottom right: short exposure; read row by row), in a scene containing objects with very different motion speeds. (b) shows a zoomed-in view of the train and a pedestrian. The train moves at a blazingly fast speed, while the pedestrian moves at a slower but still significant speed. The shortest exposure provides contour information for localization and alignment (e.g., windows/doors of the train, contour of the pedestrian) but lacks detail due to significant noise. The longer exposures reveal more information about the motion trajectory/speed and provide more bit-depth for resolving intensities. (c) shows a zoomed-in view of the window, which has high flux and lower contrast. There is very mild motion due to camera motion. Long/medium exposures can be directly used to resolve details in such cases (e.g., the building in the far background).
  • Figure 3: Why are streaming representations suitable for quanta sensors? (dc = dark counts per exposure, p.l = photon level, in number of photons). Left column: Conventional cameras suffer from read noise that increases with higher readout rates. This obscures scene details under low photon levels. QIS technology, on the other hand, has no read noise at any readout rate, making it ideal for preserving details during short exposures and in streaming sensors. Here the assumption is that $\text{QE} = 1$ for both sensors, and dark count rate to be $\approx$ 10cps. Middle column: Current QIS devices have relatively low dark counts, such that even in a longer exposure (more dark counts per exposure) with sparse photon counts, the features can still be visible in the captured frames. Right column: We examine frame delay and latency in different multi-exposure architectures. Here the darker bars represent shorter exposures. Conventional approaches cause lag and inefficiencies for downstream processing and require precise synchronization. (Row 1) Sequential exposure capturing leads to long capture times per exposure stack. (Row 2) The multi-bucket model captures multiple exposures at once, but the discretization issue persists. (Row 3) Computing multi-exposures in a streaming manner provides the latest information at any time, reducing downstream inefficiencies.
  • Figure 4: Dataset samples. (a) Real sequence interpolated from the 240 FPS NFS dataset. The interpolated sequence contains fine-grained, realistic motion. The original videos are interpolated 16x, by applying a 4x interpolation twice to avoid artifacts. (b) Kubric samples. We simulate rigid motion of objects (from ShapeNet assets dang2015raise) in 3D space with precise control of motion. This allows us to model effects such as perspective change and self-occlusion. We program the foreground with Bezier/linear motion trajectories, add raw static background images dang2015raise, and fuse to form a video sequence. Motion sequences produced by (a), (b) can be used to simulate binary frames.
  • Figure 5: Theoretical dynamic range estimation for bit width $N$. To resolve lower flux and higher flux at the same time (HDR), a higher number of binary measurements is need to be contained in the representation. Using our approach, we plot the forward response curve. We then find the left-most and right-most flux points where the SNR starts to fall below a certain value (showing 20 dB points here). $d_\text{left}$ and $d_\text{right}$ are the left/right flux estimation errors for the given flux. These values dictate the flux ranges to simulate for training data.
  • ...and 7 more figures