Table of Contents
Fetching ...

SYNAPSE: Synergizing an Adapter and Finetuning for High-Fidelity EEG Synthesis from a CLIP-Aligned Encoder

Jeyoung Lee, Hochul Kang

TL;DR

This work tackles EEG-to-image synthesis using diffusion models by addressing EEG-specific challenges such as noise and subject variability. It introduces SYNAPSE, a two-stage framework that first learns a CLIP-aligned semantic latent from EEG via a hybrid autoencoder and then finetunes a lightweight adapter on Stable Diffusion to condition images on EEG features with minimal trainable parameters. The approach achieves state-of-the-art perceptual fidelity on the CVPR40 benchmark and demonstrates strong cross-subject generalization, while revealing that reconstructing perceptual content rather than strict class labels yields more faithful generations. The results indicate that the learned latent geometry aligns with visual structure, enabling plausible reconstructions of what subjects perceive and offering a practical pathway toward brain-to-vision translation and potential clinical applications.

Abstract

Recent progress in diffusion-based generative models has enabled high-quality image synthesis conditioned on diverse modalities. Extending such models to brain signals could deepen our understanding of human perception and mental representations. However,electroencephalography (EEG) presents major challenges for image generation due to high noise, low spatial resolution, and strong inter-subject variability. Existing approaches,such as DreamDiffusion, BrainVis, and GWIT, primarily adapt EEG features to pre-trained Stable Diffusion models using complex alignment or classification pipelines, often resulting in large parameter counts and limited interpretability. We introduce SYNAPSE, a two-stage framework that bridges EEG signal representation learning and high-fidelity image synthesis. In Stage1, a CLIP-aligned EEG autoencoder learns a semantically structured latent representation by combining signal reconstruction and cross-modal alignment objectives. In Stage2, the pretrained encoder is frozen and integrated with a lightweight adaptation of Stable Diffusion, enabling efficient conditioning on EEG features with minimal trainable parameters. Our method achieves a semantically coherent latent space and state-of-the-art perceptual fidelity on the CVPR40 dataset, outperforming prior EEG-to-image models in both reconstruction efficiency and image quality. Quantitative and qualitative analyses demonstrate that SYNAPSE generalizes effectively across subjects, preserving visual semantics even when class-level agreement is reduced. These results suggest that reconstructing what the brain perceives, rather than what it classifies, is key to faithful EEG-based image generation.

SYNAPSE: Synergizing an Adapter and Finetuning for High-Fidelity EEG Synthesis from a CLIP-Aligned Encoder

TL;DR

This work tackles EEG-to-image synthesis using diffusion models by addressing EEG-specific challenges such as noise and subject variability. It introduces SYNAPSE, a two-stage framework that first learns a CLIP-aligned semantic latent from EEG via a hybrid autoencoder and then finetunes a lightweight adapter on Stable Diffusion to condition images on EEG features with minimal trainable parameters. The approach achieves state-of-the-art perceptual fidelity on the CVPR40 benchmark and demonstrates strong cross-subject generalization, while revealing that reconstructing perceptual content rather than strict class labels yields more faithful generations. The results indicate that the learned latent geometry aligns with visual structure, enabling plausible reconstructions of what subjects perceive and offering a practical pathway toward brain-to-vision translation and potential clinical applications.

Abstract

Recent progress in diffusion-based generative models has enabled high-quality image synthesis conditioned on diverse modalities. Extending such models to brain signals could deepen our understanding of human perception and mental representations. However,electroencephalography (EEG) presents major challenges for image generation due to high noise, low spatial resolution, and strong inter-subject variability. Existing approaches,such as DreamDiffusion, BrainVis, and GWIT, primarily adapt EEG features to pre-trained Stable Diffusion models using complex alignment or classification pipelines, often resulting in large parameter counts and limited interpretability. We introduce SYNAPSE, a two-stage framework that bridges EEG signal representation learning and high-fidelity image synthesis. In Stage1, a CLIP-aligned EEG autoencoder learns a semantically structured latent representation by combining signal reconstruction and cross-modal alignment objectives. In Stage2, the pretrained encoder is frozen and integrated with a lightweight adaptation of Stable Diffusion, enabling efficient conditioning on EEG features with minimal trainable parameters. Our method achieves a semantically coherent latent space and state-of-the-art perceptual fidelity on the CVPR40 dataset, outperforming prior EEG-to-image models in both reconstruction efficiency and image quality. Quantitative and qualitative analyses demonstrate that SYNAPSE generalizes effectively across subjects, preserving visual semantics even when class-level agreement is reduced. These results suggest that reconstructing what the brain perceives, rather than what it classifies, is key to faithful EEG-based image generation.

Paper Structure

This paper contains 35 sections, 7 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Qualitative comparison of SYNAPSE with state-of-the-art EEG-to-image generation models across representative classes. Each row displays generated samples for 'Airliner', 'Jack O'Lantern', and 'Panda' from a different model (Dream Diffusion dreamdiffusion, BrainVis brainvis, GWIT gwit, and Ours). While prior methods show limitations in structural coherence (e.g., 'Airliner') or semantic fidelity (e.g., 'Panda'), our model (Ours) consistently generates images with higher perceptual quality and closer adherence to the visual features of the original stimuli. This visually supports the state-of-the-art FID score reported in Table \ref{['tab:main_results']}.
  • Figure 2: Overview of SYNAPSE, consisting of two stages. (Left) Stage 1: Pre-training the hybrid EEG Autoencoder. The encoder $f_{\text{enc}}$ learns to produce a CLIP-aligned latent vector $Z_{\text{latent}}$ by minimizing (1) a reconstruction loss ($L_{\text{recon}}$) for signal fidelity and (2) an alignment loss ($L_{\text{align}} + L_{\text{Contrastive}}$) for semantic consistency with CLIP image–text embeddings. (Right) Stage 2: Finetuning Stable Diffusion. The pre-trained encoder $f_{\text{enc}}$ is frozen, while a lightweight EEG Adaptation module $f_{\text{adapt}}$ and selected cross-attention layers of the SD U-Net are finetuned to synthesize images conditioned on both $Z_{\text{latent}}$ and $Z_{\text{adapt}}$.
  • Figure 3: Architecture of the hybrid EEG Autoencoder. It employs a U-Net backbone that processes temporal signals through residual blocks and spatial dependencies through transformer blocks to output a CLIP-aligned latent representation.
  • Figure 4: Subject 4 T-SNE Example
  • Figure 5: Multi Subject T-SNE Example
  • ...and 8 more figures