Table of Contents
Fetching ...

Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance

Jaehoon Joo, Taejin Jeong, Seongjae Hwang

TL;DR

Inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, this framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings and provides accurate multi-modal guidance to LDMs.

Abstract

Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.

Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance

TL;DR

Inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, this framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings and provides accurate multi-modal guidance to LDMs.

Abstract

Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.
Paper Structure (9 sections, 4 figures, 3 tables)

This paper contains 9 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of results demonstrating the impact of textual guidance on visual stimuli reconstruction. In each triplet, the left column displays the original visual stimuli, while the middle and right columns present the reconstructed images without and with textual guidance, respectively. The predicted caption is generated using fMRI data. Notably, textual guidance enhances the capture of accurate semantic details, such as glasses and the shape of a bird.
  • Figure 2: Illustration of training pipeline. High-level focuses on the mapping between the fMRI data and BERT's latent vectors to make textual guidance. In the mid-level, we map the fMRI data to the distribution of CLIP image embeddings, which will serve as rough semantic visual guidance. For low-level, training progresses by aligning SD's latent vectors with the fMRI data to provide perceptual guidance.
  • Figure 3: Illustration of inference pipeline. In high-level, $h_{pred}$ is predicted from ventral region and decoded with GPT-2. Then Llama-2 refines them to make one cohesive sentence. Image embeddings, serving as visual semantic guidance for VD, are predicted from the nsdgeneral region at the mid-level. At the low-level, $l_{pred}$ predicted from early visual cortex is decoded with SD decoder to generate a image layout. This image layout, along with refined caption and predicted image embedding, is then fed into the VD using an img2img approach to produce the final reconstructed image.
  • Figure 4: Qualitative comparison of reconstructed images with other methods. The first two row are the original visual stimuli and corresponding ground truth captions.