HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou
TL;DR
Reconstructing complex visual scenes from fMRI is hampered by low-level heterogeneity and high-level semantic entanglement. HAVIR tackles this by decomposing fMRI signals into structural and semantic voxels via a Structural Generator and a Semantic Extractor, whose outputs are fused in a pre-trained Versatile Diffusion model guided by CLIP embeddings. On the NSD dataset, HAVIR achieves superior structural fidelity and semantic alignment, outperforming state-of-the-art approaches across quantitative metrics and qualitative reconstructions. The approach supports individualized, cross-subject brain decoding with ROI-aware customization, advancing brain-computer vision applications.
Abstract
The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.
