Table of Contents
Fetching ...

HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou

TL;DR

Reconstructing complex visual scenes from fMRI is hampered by low-level heterogeneity and high-level semantic entanglement. HAVIR tackles this by decomposing fMRI signals into structural and semantic voxels via a Structural Generator and a Semantic Extractor, whose outputs are fused in a pre-trained Versatile Diffusion model guided by CLIP embeddings. On the NSD dataset, HAVIR achieves superior structural fidelity and semantic alignment, outperforming state-of-the-art approaches across quantitative metrics and qualitative reconstructions. The approach supports individualized, cross-subject brain decoding with ROI-aware customization, advancing brain-computer vision applications.

Abstract

The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

TL;DR

Reconstructing complex visual scenes from fMRI is hampered by low-level heterogeneity and high-level semantic entanglement. HAVIR tackles this by decomposing fMRI signals into structural and semantic voxels via a Structural Generator and a Semantic Extractor, whose outputs are fused in a pre-trained Versatile Diffusion model guided by CLIP embeddings. On the NSD dataset, HAVIR achieves superior structural fidelity and semantic alignment, outperforming state-of-the-art approaches across quantitative metrics and qualitative reconstructions. The approach supports individualized, cross-subject brain decoding with ROI-aware customization, advancing brain-computer vision applications.

Abstract

The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

Paper Structure

This paper contains 18 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overall framework of HAVIR. Different data are used for training and testing.
  • Figure 2: Examples of the reconstruction results from HAVIR.
  • Figure 3: Qualitative comparisons on the NSD test dataset. The results of HAVIR demonstrate superior reconstruction accuracy compared to the five recent SOTA methods.
  • Figure 4: Qualitative results of the full model and its ablated configurations
  • Figure 5: Spatial mapping of brain region contributions to Structural Generator (A) and Semantic Extractor (B) on Subj01.
  • ...and 1 more figures