SpikeGen: Decoupled "Rods and Cones" Visual Representation Processing with Latent Generative Framework
Gaole Dai, Menghang Dong, Rongyu Zhang, Ruichuan An, Shanghang Zhang, Tiejun Huang
TL;DR
SpikeGen addresses the challenge of jointly processing decoupled visual inputs from RGB cameras and spike-based sensors by learning in a latent space with diffusion. It introduces a configurable dual-modality pre-training pipeline, combining RGB latent representations via a VAE with spike latent representations via a Spatial-Temporal Separable encoder, and uses per-token latent diffusion conditioned on complete tokens from both sources. A random modality dropout and a spike-alignment strategy enable robust downstream adaptation for deblurring, dense frame reconstruction, and high-speed novel-view synthesis, outperforming state-of-the-art baselines across multiple benchmarks. This latent-generation approach reduces computational overhead, mitigates the sharpness trap of pixel-space methods, and provides a foundation for neuromorphic-vision systems that fuse perception and generative inference in dynamic environments.
Abstract
The process through which humans perceive and learn visual representations in dynamic environments is highly complex. From a structural perspective, the human eye decouples the functions of cone and rod cells: cones are primarily responsible for color perception, while rods are specialized in detecting motion, particularly variations in light intensity. These two distinct modalities of visual information are integrated and processed within the visual cortex, thereby enhancing the robustness of the human visual system. Inspired by this biological mechanism, modern hardware systems have evolved to include not only color-sensitive RGB cameras but also motion-sensitive Dynamic Visual Systems, such as spike cameras. Building upon these advancements, this study seeks to emulate the human visual system by integrating decomposed multi-modal visual inputs with modern latent-space generative frameworks. We named it SpikeGen. We evaluate its performance across various spike-RGB tasks, including conditional image and video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by extensive experiments, we demonstrate that leveraging the latent space manipulation capabilities of generative models enables an effective synergistic enhancement of different visual modalities, addressing spatial sparsity in spike inputs and temporal sparsity in RGB inputs.
