Table of Contents
Fetching ...

Towards Interpretable Visual Decoding with Attention to Brain Representations

Pinyuan Feng, Hossein Adeli, Wenxuan Guo, Fan Cheng, Ethan Hwang, Nikolaus Kriegeskorte

TL;DR

This work introduces an Image-Brain BI-directional interpretability framework (IBBI) that analyzes cross-attention patterns across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory and highlights the potential of end-to-end brain-to-image reconstruction.

Abstract

Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, offering new ways to probe how the brain represents real-world scenes. However, many existing approaches first map brain signals into intermediate image or text feature spaces before guiding the generative process, which obscures the contributions of different brain areas to the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals drive visual reconstruction. To this end, we introduce an Image-Brain BI-directional interpretability framework (IBBI) that analyzes cross-attention patterns across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our work highlights the potential of end-to-end brain-to-image reconstruction and establishes a path for interpretable neural decoding.

Towards Interpretable Visual Decoding with Attention to Brain Representations

TL;DR

This work introduces an Image-Brain BI-directional interpretability framework (IBBI) that analyzes cross-attention patterns across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory and highlights the potential of end-to-end brain-to-image reconstruction.

Abstract

Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, offering new ways to probe how the brain represents real-world scenes. However, many existing approaches first map brain signals into intermediate image or text feature spaces before guiding the generative process, which obscures the contributions of different brain areas to the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals drive visual reconstruction. To this end, we introduce an Image-Brain BI-directional interpretability framework (IBBI) that analyzes cross-attention patterns across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our work highlights the potential of end-to-end brain-to-image reconstruction and establishes a path for interpretable neural decoding.

Paper Structure

This paper contains 48 sections, 6 equations, 20 figures, 14 tables.

Figures (20)

  • Figure 1: Overview. Left: Typical two-stage decoding pipelines first map brain activity to intermediate feature spaces (e.g., CLIP/DINO) and then use those embeddings to guide a generative model. Right: Our end-to-end approach conditions a latent diffusion model directly on brain activity, enabling interpretations of the generative dynamics in both image and brain spaces.
  • Figure 2: NeuroAdapter training pipeline. (a) fMRI data collection paradigm, (b) cortical parcellation, (c) parcel-wise linear mapping from vertices to brain representation tokens, and (d) conditioning a latent diffusion model on these tokens for reconstruction.
  • Figure 3: Brain Encoder. (a) Brain encoder–based image selection using Pearson correlations between predicted and measured fMRI responses for an NSD test example. (b) Red: correlation between the predicted brain activity from the decoded images and the measured brain activity. Blue: correlation between the predicted activity for the stimulus in testing set and the corresponding fMRI response.
  • Figure 4: Ground truth with decoded stimuli from NeuroAdapter across 4 subjects.
  • Figure 5: Model Comparison. Decoding performance across eight image quality metrics, comparing prior approaches and our method. To ensure fair comparison, results are shown as relative improvements over a subject-specific ImageNet-retrieval baseline. (a) NeuroAdapter achieves competitive performance with embedding-aligned approaches, particularly on high-level semantic metrics. (b) Comparison with Brain Diffuser variants shows that their advantage on low-level metrics arises from a dedicated pathway for predicting latent visual features (VDVAE), whereas removing this pathway yields performance on low-level metrics comparable to ours.
  • ...and 15 more figures