Table of Contents
Fetching ...

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, Michal Irani

TL;DR

Brain-IT addresses the challenge of reconstructing visually faithful images from fMRI by introducing a Brain Interaction Transformer (BIT) that operates on functionally defined, cross-subject voxel clusters. By mapping voxels to 128 shared functional clusters (V2C) and enabling interactions among Brain Tokens, BIT predicts localized semantic and low-level image features that drive a dual-branch reconstruction: a semantic diffusion generator conditioned on adapted CLIP tokens and a low-level image reconstruction via Deep Image Prior (DIP) inverted from predicted VGG features. The two branches are fused to initialize diffusion with structural priors and refine semantic content, yielding state-of-the-art reconstructions and enabling transfer learning to new subjects with as little as 15 minutes of data, approaching results obtained with 40 hours of training. This cross-subject, voxel-centric, two-branch framework demonstrates strong reconstruction fidelity and practical transfer efficiency, suggesting that Brain-IT can generalize across individuals and potentially extend to broader brain-imaging tasks.

Abstract

Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.

Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer

TL;DR

Brain-IT addresses the challenge of reconstructing visually faithful images from fMRI by introducing a Brain Interaction Transformer (BIT) that operates on functionally defined, cross-subject voxel clusters. By mapping voxels to 128 shared functional clusters (V2C) and enabling interactions among Brain Tokens, BIT predicts localized semantic and low-level image features that drive a dual-branch reconstruction: a semantic diffusion generator conditioned on adapted CLIP tokens and a low-level image reconstruction via Deep Image Prior (DIP) inverted from predicted VGG features. The two branches are fused to initialize diffusion with structural priors and refine semantic content, yielding state-of-the-art reconstructions and enabling transfer learning to new subjects with as little as 15 minutes of data, approaching results obtained with 40 hours of training. This cross-subject, voxel-centric, two-branch framework demonstrates strong reconstruction fidelity and practical transfer efficiency, suggesting that Brain-IT can generalize across individuals and potentially extend to broader brain-imaging tasks.

Abstract

Reconstructing images seen by people from their fMRI brain recordings provides a non-invasive window into the human brain. Despite recent progress enabled by diffusion models, current methods often lack faithfulness to the actual seen images. We present "Brain-IT", a brain-inspired approach that addresses this challenge through a Brain Interaction Transformer (BIT), allowing effective interactions between clusters of functionally-similar brain-voxels. These functional-clusters are shared by all subjects, serving as building blocks for integrating information both within and across brains. All model components are shared by all clusters & subjects, allowing efficient training with a limited amount of data. To guide the image reconstruction, BIT predicts two complementary localized patch-level image features: (i)high-level semantic features which steer the diffusion model toward the correct semantic content of the image; and (ii)low-level structural features which help to initialize the diffusion process with the correct coarse layout of the image. BIT's design enables direct flow of information from brain-voxel clusters to localized image features. Through these principles, our method achieves image reconstructions from fMRI that faithfully reconstruct the seen images, and surpass current SotA approaches both visually and by standard objective metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve results comparable to current methods trained on full 40-hour recordings.

Paper Structure

This paper contains 54 sections, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Reconstruction of seen images from fMRI using "Brain-IT".(a) Image reconstructions using the full NSD dataset (40 hours per subject). (b) Efficient Transfer-learning to new subjects with very little data: Meaningful reconstructions are obtained with only 15 minutes of fMRI recordings. (Results on Subject 1)
  • Figure 2: Overview of Brain-IT.(a) Brain Interaction Transformer (BIT) transforms fMRI signals into Semantic and VGG features using the Voxel-to-Cluster (V2C) mapping. Two branches are applied: (i) the Low-Level branch reconstructs a coarse image from VGG features, used to initialize the (ii) Semantic branch, which uses semantic features to guide the diffusion model. (b) Voxel-to-Cluster mapping (V2C): each voxel from every subject is mapped to a functional cluster shared across subjects. (c) Low-level branch: VGG-predicted features are inverted using Deep Image Prior (DIP) to reconstruct a coarse image layout.
  • Figure 3: Comparing methods on 40-hour data (for Subject 1).Brain-IT is compared to 3 leading methods, yielding reconstructions that better preserve both semantic content and low-level visual properties. Brain-IT better reconstructs the correct objects with relevant structural details (e.g., orientation, color), providing reconstructions more faithful to the seen images. See many more examples in Appendix \ref{['fig:app_40hour']}
  • Figure 4: Architecture of the Brain-Interaction-Transformer (BIT). (\ref{['sec:BIT']})
  • Figure 5: Reconstruction with limited amount of subject-specific data (1 hour). We compare Brain-IT against 2 leading approaches which provide also 1-hour reconstructions (MindEye2 & MindTuner) for Subj1. Brain-IT demonstrates greater fidelity to the seen image. See many more examples in App.\ref{['fig:app_1hour']}
  • ...and 14 more figures