Table of Contents
Fetching ...

Rethinking High-speed Image Reconstruction Framework with Spike Camera

Kang Chen, Yajing Zheng, Tiejun Huang, Zhaofei Yu

TL;DR

This work tackles spike-to-image reconstruction under challenging low-light conditions by introducing SpikeCLIP, a CLIP-guided framework that uses class labels and unpaired high-quality images as supervision instead of ground-truth sharp frames. The method combines a coarse reconstruction stage, a learnable prompt learning component to distinguish HQ and LQ distributions, and a fine reconstruction stage guided by prompt and class losses, enabling texture-rich and brightness-balanced reconstructions from sparse spike streams. Experiments on real-world datasets U-CALTECH and U-CIFAR show substantial improvements over state-of-the-art methods in perceptual quality and downstream-task alignment, with an efficient, lightweight reconstruction network. The approach demonstrates the practical impact of cross-modal supervision for neuromorphic imaging, offering robust performance in real-world, extreme conditions.

Abstract

Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality images from the spike input under low-light conditions remains challenging. Conventional learning-based methods often rely on the synthetic dataset as the supervision for training. Still, these approaches falter when dealing with noisy spikes fired under the low-light environment, leading to further performance degradation in the real-world dataset. This phenomenon is primarily due to inadequate noise modelling and the domain gap between synthetic and real datasets, resulting in recovered images with unclear textures, excessive noise, and diminished brightness. To address these challenges, we introduce a novel spike-to-image reconstruction framework SpikeCLIP that goes beyond traditional training paradigms. Leveraging the CLIP model's powerful capability to align text and images, we incorporate the textual description of the captured scene and unpaired high-quality datasets as the supervision. Our experiments on real-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP significantly enhances texture details and the luminance balance of recovered images. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.

Rethinking High-speed Image Reconstruction Framework with Spike Camera

TL;DR

This work tackles spike-to-image reconstruction under challenging low-light conditions by introducing SpikeCLIP, a CLIP-guided framework that uses class labels and unpaired high-quality images as supervision instead of ground-truth sharp frames. The method combines a coarse reconstruction stage, a learnable prompt learning component to distinguish HQ and LQ distributions, and a fine reconstruction stage guided by prompt and class losses, enabling texture-rich and brightness-balanced reconstructions from sparse spike streams. Experiments on real-world datasets U-CALTECH and U-CIFAR show substantial improvements over state-of-the-art methods in perceptual quality and downstream-task alignment, with an efficient, lightweight reconstruction network. The approach demonstrates the practical impact of cross-modal supervision for neuromorphic imaging, offering robust performance in real-world, extreme conditions.

Abstract

Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality images from the spike input under low-light conditions remains challenging. Conventional learning-based methods often rely on the synthetic dataset as the supervision for training. Still, these approaches falter when dealing with noisy spikes fired under the low-light environment, leading to further performance degradation in the real-world dataset. This phenomenon is primarily due to inadequate noise modelling and the domain gap between synthetic and real datasets, resulting in recovered images with unclear textures, excessive noise, and diminished brightness. To address these challenges, we introduce a novel spike-to-image reconstruction framework SpikeCLIP that goes beyond traditional training paradigms. Leveraging the CLIP model's powerful capability to align text and images, we incorporate the textual description of the captured scene and unpaired high-quality datasets as the supervision. Our experiments on real-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP significantly enhances texture details and the luminance balance of recovered images. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.
Paper Structure (28 sections, 10 equations, 5 figures, 3 tables)

This paper contains 28 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of the advantages of our method. While previous learning-based approaches struggle with real-world data under extreme conditions, such as low-light scenarios, our proposed SpikeCLIP successfully reconstructs high-quality images.
  • Figure 2: The overall framework of our three-stage spike-based image reconstruction method.
  • Figure 3: The framework of our designed HQ images generation pipeline.
  • Figure 4: Prompt and class loss illustration.
  • Figure 5: Visual comparison of our method with previous methods on the U-CALTEHC dataset.