Table of Contents
Fetching ...

SpikeGen: Decoupled "Rods and Cones" Visual Representation Processing with Latent Generative Framework

Gaole Dai, Menghang Dong, Rongyu Zhang, Ruichuan An, Shanghang Zhang, Tiejun Huang

TL;DR

SpikeGen addresses the challenge of jointly processing decoupled visual inputs from RGB cameras and spike-based sensors by learning in a latent space with diffusion. It introduces a configurable dual-modality pre-training pipeline, combining RGB latent representations via a VAE with spike latent representations via a Spatial-Temporal Separable encoder, and uses per-token latent diffusion conditioned on complete tokens from both sources. A random modality dropout and a spike-alignment strategy enable robust downstream adaptation for deblurring, dense frame reconstruction, and high-speed novel-view synthesis, outperforming state-of-the-art baselines across multiple benchmarks. This latent-generation approach reduces computational overhead, mitigates the sharpness trap of pixel-space methods, and provides a foundation for neuromorphic-vision systems that fuse perception and generative inference in dynamic environments.

Abstract

The process through which humans perceive and learn visual representations in dynamic environments is highly complex. From a structural perspective, the human eye decouples the functions of cone and rod cells: cones are primarily responsible for color perception, while rods are specialized in detecting motion, particularly variations in light intensity. These two distinct modalities of visual information are integrated and processed within the visual cortex, thereby enhancing the robustness of the human visual system. Inspired by this biological mechanism, modern hardware systems have evolved to include not only color-sensitive RGB cameras but also motion-sensitive Dynamic Visual Systems, such as spike cameras. Building upon these advancements, this study seeks to emulate the human visual system by integrating decomposed multi-modal visual inputs with modern latent-space generative frameworks. We named it SpikeGen. We evaluate its performance across various spike-RGB tasks, including conditional image and video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by extensive experiments, we demonstrate that leveraging the latent space manipulation capabilities of generative models enables an effective synergistic enhancement of different visual modalities, addressing spatial sparsity in spike inputs and temporal sparsity in RGB inputs.

SpikeGen: Decoupled "Rods and Cones" Visual Representation Processing with Latent Generative Framework

TL;DR

SpikeGen addresses the challenge of jointly processing decoupled visual inputs from RGB cameras and spike-based sensors by learning in a latent space with diffusion. It introduces a configurable dual-modality pre-training pipeline, combining RGB latent representations via a VAE with spike latent representations via a Spatial-Temporal Separable encoder, and uses per-token latent diffusion conditioned on complete tokens from both sources. A random modality dropout and a spike-alignment strategy enable robust downstream adaptation for deblurring, dense frame reconstruction, and high-speed novel-view synthesis, outperforming state-of-the-art baselines across multiple benchmarks. This latent-generation approach reduces computational overhead, mitigates the sharpness trap of pixel-space methods, and provides a foundation for neuromorphic-vision systems that fuse perception and generative inference in dynamic environments.

Abstract

The process through which humans perceive and learn visual representations in dynamic environments is highly complex. From a structural perspective, the human eye decouples the functions of cone and rod cells: cones are primarily responsible for color perception, while rods are specialized in detecting motion, particularly variations in light intensity. These two distinct modalities of visual information are integrated and processed within the visual cortex, thereby enhancing the robustness of the human visual system. Inspired by this biological mechanism, modern hardware systems have evolved to include not only color-sensitive RGB cameras but also motion-sensitive Dynamic Visual Systems, such as spike cameras. Building upon these advancements, this study seeks to emulate the human visual system by integrating decomposed multi-modal visual inputs with modern latent-space generative frameworks. We named it SpikeGen. We evaluate its performance across various spike-RGB tasks, including conditional image and video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by extensive experiments, we demonstrate that leveraging the latent space manipulation capabilities of generative models enables an effective synergistic enhancement of different visual modalities, addressing spatial sparsity in spike inputs and temporal sparsity in RGB inputs.

Paper Structure

This paper contains 46 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of the motivation, task setups and main quantitative results of SpikeGen.
  • Figure 2: The Overall Pipeline of SpikeGen. SpikeGen adopts a standard pre-training (self-supervised) and fine-tuning (task-dependent) pipeline. Specifically, visual information from two modalities is encoded, followed by the addition of the two latent representations. In this process, $\gamma$ serves as a parameter to control the effective weight and is randomly sampled from the interval [0, 1]. Notice that during the pre-training phase, the diffusion loss is computed using the pre-extracted latent representation of the clear RGB image obtained via a Variational Autoencoder (VAE) kingma2013auto. The spike stream loss during the fine-tuning phase is calculated based on the given ground truth and the synthetic spike stream generated from the predicted RGB output.
  • Figure 3: Conditional Image Deblurring on Synthetic RGB-Spike Data. For all our experiments, the input visual spike streams for SpikeGen are in binary format (0/1) without colour information. Here, for better visualization, row 2 (from top) of the left panel demonstrates the cut-out result of the RGB channels using 3 spike frames (results in row 4 used 8 frames as input). We also magnified a few results with highlighted detail for better comparison of the structural correctness (right panel).
  • Figure 4: Mutual Guidance with Dual Modality Inputs. The top panel presents the results obtained from both RGB-based deblurring methods and spike-RGB-based approaches. SpikeGen demonstrated superior performance compared to all competitors in terms of visual fidelity. The bottom panel illustrates the outcomes of various methods when only a limited number of spike frames (here, 16 frames) are available. SpikeGen addresses spatial ambiguity caused by spike sparsity by leveraging the merged result of spike frames (i.e., TFP) as a pseudo-dense modality.
  • Figure 5: Qualitative Results of Novel View Synthesis Task.
  • ...and 4 more figures