Table of Contents
Fetching ...

Learning Brain Representation with Hierarchical Visual Embeddings

Jiawen Zheng, Haonan Jia, Ming Li, Yuhui Zheng, Yufeng Zeng, Yang Gao, Chen Liang

TL;DR

The paper tackles brain-to-visual decoding by bridging brain signals with hierarchical visual representations constructed from multiple pretrained encoders that span semantic and pixel-level information. It introduces a Fusion-based brain–vision interface comprising a Hierarchical Visual Fusion (HVF) and a pretrained Fusion Prior, enabling contrastive alignment of brain embeddings to a stable, multimodal visual space and subsequent diffusion-based reconstruction with text-free conditioning. Extensive experiments on THINGS-EEG and THINGS-MEG demonstrate state-of-the-art 200-way zero-shot retrieval and improved reconstruction quality, with ablations highlighting the complementary value of combining CLIP-style semantics and VAE-based pixel features. The approach is plug-and-play across EEG backbones and provides insights into the multiscale visual content encoded in brain signals, highlighting both the high-level and low-level information captured by neural activity. Its practical impact lies in advancing robust brain-to-visual decoding, with potential implications for neuroscience, AI-assisted brain-computer interfaces, and understanding visual representation in the human brain.

Abstract

Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.

Learning Brain Representation with Hierarchical Visual Embeddings

TL;DR

The paper tackles brain-to-visual decoding by bridging brain signals with hierarchical visual representations constructed from multiple pretrained encoders that span semantic and pixel-level information. It introduces a Fusion-based brain–vision interface comprising a Hierarchical Visual Fusion (HVF) and a pretrained Fusion Prior, enabling contrastive alignment of brain embeddings to a stable, multimodal visual space and subsequent diffusion-based reconstruction with text-free conditioning. Extensive experiments on THINGS-EEG and THINGS-MEG demonstrate state-of-the-art 200-way zero-shot retrieval and improved reconstruction quality, with ablations highlighting the complementary value of combining CLIP-style semantics and VAE-based pixel features. The approach is plug-and-play across EEG backbones and provides insights into the multiscale visual content encoded in brain signals, highlighting both the high-level and low-level information captured by neural activity. Its practical impact lies in advancing robust brain-to-visual decoding, with potential implications for neuroscience, AI-assisted brain-computer interfaces, and understanding visual representation in the human brain.

Abstract

Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.
Paper Structure (37 sections, 7 equations, 9 figures, 16 tables)

This paper contains 37 sections, 7 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Learning pipelines. Left: Retrieval objective that aligns the brain embedding $z_b$ with the fused visual embedding $z_f$ (HVF over $K$ pretrained encoders) using a symmetric InfoNCE; evaluation is nearest-neighbor retrieval in the fused space. Right: Reconstruction pipeline with a frozen, pretrained fusion prior—HVF plus a Conditioning Adapter (MLP projector + IP-Adapter with decoupled cross-attention). We contrastively align $z_b$ to the frozen $z_f$, project to $z_c$, and inject $z_c$ into a frozen SDXL UNet to synthesize the image. Visual encoders and the UNet are frozen; only the brain side is updated during alignment.
  • Figure 2: Hard-case retrieval comparison. The top-5 retrieved images on the hard-case set from our method and the UBP baseline.
  • Figure 3: Qualitative comparison of brain-to-image reconstructions. Each triplet shows the ground-truth stimulus (left), baseline (middle), and our reconstruction (right). All examples use EEG recordings from subject 8.
  • Figure 4: Ablation on fusion priors. Each row shows the ground-truth stimulus and reconstructions produced with different fused configurations: H14, H14+B32, H14+VAE, and H14+B32+VAE. All examples use EEG recordings from subject 8.
  • Figure 5: UMAP visualization of learned embeddings on the test split of the THINGS-EEG dataset. Left: multi-encoder visual embeddings (RN50, flattened VAE) and the fused token are projected together with Subject 8 EEG embeddings. Right: EEG embeddings from all 10 subjects.
  • ...and 4 more figures