SYNAPSE: Synergizing an Adapter and Finetuning for High-Fidelity EEG Synthesis from a CLIP-Aligned Encoder
Jeyoung Lee, Hochul Kang
TL;DR
This work tackles EEG-to-image synthesis using diffusion models by addressing EEG-specific challenges such as noise and subject variability. It introduces SYNAPSE, a two-stage framework that first learns a CLIP-aligned semantic latent from EEG via a hybrid autoencoder and then finetunes a lightweight adapter on Stable Diffusion to condition images on EEG features with minimal trainable parameters. The approach achieves state-of-the-art perceptual fidelity on the CVPR40 benchmark and demonstrates strong cross-subject generalization, while revealing that reconstructing perceptual content rather than strict class labels yields more faithful generations. The results indicate that the learned latent geometry aligns with visual structure, enabling plausible reconstructions of what subjects perceive and offering a practical pathway toward brain-to-vision translation and potential clinical applications.
Abstract
Recent progress in diffusion-based generative models has enabled high-quality image synthesis conditioned on diverse modalities. Extending such models to brain signals could deepen our understanding of human perception and mental representations. However,electroencephalography (EEG) presents major challenges for image generation due to high noise, low spatial resolution, and strong inter-subject variability. Existing approaches,such as DreamDiffusion, BrainVis, and GWIT, primarily adapt EEG features to pre-trained Stable Diffusion models using complex alignment or classification pipelines, often resulting in large parameter counts and limited interpretability. We introduce SYNAPSE, a two-stage framework that bridges EEG signal representation learning and high-fidelity image synthesis. In Stage1, a CLIP-aligned EEG autoencoder learns a semantically structured latent representation by combining signal reconstruction and cross-modal alignment objectives. In Stage2, the pretrained encoder is frozen and integrated with a lightweight adaptation of Stable Diffusion, enabling efficient conditioning on EEG features with minimal trainable parameters. Our method achieves a semantically coherent latent space and state-of-the-art perceptual fidelity on the CVPR40 dataset, outperforming prior EEG-to-image models in both reconstruction efficiency and image quality. Quantitative and qualitative analyses demonstrate that SYNAPSE generalizes effectively across subjects, preserving visual semantics even when class-level agreement is reduced. These results suggest that reconstructing what the brain perceives, rather than what it classifies, is key to faithful EEG-based image generation.
