Learning Brain Representation with Hierarchical Visual Embeddings
Jiawen Zheng, Haonan Jia, Ming Li, Yuhui Zheng, Yufeng Zeng, Yang Gao, Chen Liang
TL;DR
The paper tackles brain-to-visual decoding by bridging brain signals with hierarchical visual representations constructed from multiple pretrained encoders that span semantic and pixel-level information. It introduces a Fusion-based brain–vision interface comprising a Hierarchical Visual Fusion (HVF) and a pretrained Fusion Prior, enabling contrastive alignment of brain embeddings to a stable, multimodal visual space and subsequent diffusion-based reconstruction with text-free conditioning. Extensive experiments on THINGS-EEG and THINGS-MEG demonstrate state-of-the-art 200-way zero-shot retrieval and improved reconstruction quality, with ablations highlighting the complementary value of combining CLIP-style semantics and VAE-based pixel features. The approach is plug-and-play across EEG backbones and provides insights into the multiscale visual content encoded in brain signals, highlighting both the high-level and low-level information captured by neural activity. Its practical impact lies in advancing robust brain-to-visual decoding, with potential implications for neuroscience, AI-assisted brain-computer interfaces, and understanding visual representation in the human brain.
Abstract
Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.
