Rethinking High-speed Image Reconstruction Framework with Spike Camera
Kang Chen, Yajing Zheng, Tiejun Huang, Zhaofei Yu
TL;DR
This work tackles spike-to-image reconstruction under challenging low-light conditions by introducing SpikeCLIP, a CLIP-guided framework that uses class labels and unpaired high-quality images as supervision instead of ground-truth sharp frames. The method combines a coarse reconstruction stage, a learnable prompt learning component to distinguish HQ and LQ distributions, and a fine reconstruction stage guided by prompt and class losses, enabling texture-rich and brightness-balanced reconstructions from sparse spike streams. Experiments on real-world datasets U-CALTECH and U-CIFAR show substantial improvements over state-of-the-art methods in perceptual quality and downstream-task alignment, with an efficient, lightweight reconstruction network. The approach demonstrates the practical impact of cross-modal supervision for neuromorphic imaging, offering robust performance in real-world, extreme conditions.
Abstract
Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality images from the spike input under low-light conditions remains challenging. Conventional learning-based methods often rely on the synthetic dataset as the supervision for training. Still, these approaches falter when dealing with noisy spikes fired under the low-light environment, leading to further performance degradation in the real-world dataset. This phenomenon is primarily due to inadequate noise modelling and the domain gap between synthetic and real datasets, resulting in recovered images with unclear textures, excessive noise, and diminished brightness. To address these challenges, we introduce a novel spike-to-image reconstruction framework SpikeCLIP that goes beyond traditional training paradigms. Leveraging the CLIP model's powerful capability to align text and images, we incorporate the textual description of the captured scene and unpaired high-quality datasets as the supervision. Our experiments on real-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP significantly enhances texture details and the luminance balance of recovered images. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.
