Table of Contents
Fetching ...

Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment

Zehui Feng, Chenqi Zhang, Mingru Wang, Minuo Wei, Shiwei Cheng, Cuntai Guan, Ting Han

TL;DR

Bratrix tackles the challenge of mapping neural signals from EEG, MEG, and fMRI to rich visual semantics by anchoring cross-modal alignment in language. It introduces Vision Semantic Decoupling and Language Semantic Decoupling to disentangle visual information and ground neural representations in language-derived semantics, complemented by an uncertainty perception module and a language-aligned cross-modal loss. A two-stage training regime ( unimodal pretraining followed by multimodal fine-tuning) yields Bratrix and Bratrix-M, which achieve state-of-the-art retrieval, reconstruction, and captioning on THINGS-based benchmarks, including a notable 14.3% improvement in 200-way EEG retrieval. The approach promises more interpretable, robust brain–computer alignment with potential applications in neuroimaging, neuroscience research, and assistive technologies, while suggesting directions for overcoming inter-subject variability and extending to dynamic, multi-modal cognition.

Abstract

Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.

Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment

TL;DR

Bratrix tackles the challenge of mapping neural signals from EEG, MEG, and fMRI to rich visual semantics by anchoring cross-modal alignment in language. It introduces Vision Semantic Decoupling and Language Semantic Decoupling to disentangle visual information and ground neural representations in language-derived semantics, complemented by an uncertainty perception module and a language-aligned cross-modal loss. A two-stage training regime ( unimodal pretraining followed by multimodal fine-tuning) yields Bratrix and Bratrix-M, which achieve state-of-the-art retrieval, reconstruction, and captioning on THINGS-based benchmarks, including a notable 14.3% improvement in 200-way EEG retrieval. The approach promises more interpretable, robust brain–computer alignment with potential applications in neuroimaging, neuroscience research, and assistive technologies, while suggesting directions for overcoming inter-subject variability and extending to dynamic, multi-modal cognition.

Abstract

Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.

Paper Structure

This paper contains 33 sections, 10 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Overall of Language-Anchored Vision-Brain Alignment.
  • Figure 2: Overall Framework of Bratrix. Bratrix comprises Vision semantic decoupling module, a Brain encoder module, a language semantic decoupling module, and a language-anchored visual-brain alignment module. There are totally four stages in this framework: single-modal pre-training phase, multi-modal fine-tuning phase, inference phase, and downstream task phase.
  • Figure 3: (a). Zero-shot Top-10 Performance of EEG signal retrieval visualization. (b). Example of multimodal semantics, uncertainty, and semantic matrix visualization. (c). Zero-shot Top-5 Performance of comparable methods. (d). Zero-shot Top-5 Performance of ablation experiment. (e). Representational similarity matrices (RSM) of brain neural signals across categories (Tool, Food, Clothes, Vehicle, Animal, and Others), and zoomed-in view of Food category.
  • Figure 4: T-SNE Visualization vandermaaten08a of Bratrix and Bratrix-M with semantic categories and subject categories.
  • Figure 5: (a) Zero-shot comparison across 10 subjects and (b) comparison between coarse and aligned EEG.
  • ...and 15 more figures