UMBRAE: Unified Multimodal Brain Decoding

Weihao Xia; Raoul de Charette; Cengiz Öztireli; Jing-Hao Xue

UMBRAE: Unified Multimodal Brain Decoding

Weihao Xia, Raoul de Charette, Cengiz Öztireli, Jing-Hao Xue

TL;DR

UMBRAE tackles the challenge of decoding brain signals across subjects by introducing a universal brain encoder that aligns neural activity with pretrained image features, enabling multimodal decoding through frozen multimodal language models. A cross-subject training strategy maps diverse subjects into a common space, supporting data-efficient adaptation to new subjects. The paper also presents BrainHub, a comprehensive NSD-extended benchmark that pairs fMRI with semantic and spatial annotations to evaluate brain captioning, grounding, retrieval, and visual decoding. Across tasks, UMBRAE achieves superior or competitive performance with improved efficiency, demonstrating robust cross-subject generalization and practical subject adaptation, with code and BrainHub made publicly available.

Abstract

We address prevailing challenges of the brain-powered research, departing from the observation that the literature hardly recover accurate spatial information and require subject-specific models. To address these challenges, we propose UMBRAE, a unified multimodal decoding of brain signals. First, to extract instance-level conceptual and spatial details from neural signals, we introduce an efficient universal brain encoder for multimodal-brain alignment and recover object descriptions at multiple levels of granularity from subsequent multimodal large language model (MLLM). Second, we introduce a cross-subject training strategy mapping subject-specific features to a common feature space. This allows a model to be trained on multiple subjects without extra resources, even yielding superior results compared to subject-specific models. Further, we demonstrate this supports weakly-supervised adaptation to new subjects, with only a fraction of the total training data. Experiments demonstrate that UMBRAE not only achieves superior results in the newly introduced tasks but also outperforms methods in well established tasks. To assess our method, we construct and share with the community a comprehensive brain understanding benchmark BrainHub. Our code and benchmark are available at https://weihaox.github.io/UMBRAE.

UMBRAE: Unified Multimodal Brain Decoding

TL;DR

Abstract

Paper Structure (38 sections, 2 equations, 19 figures, 13 tables)

This paper contains 38 sections, 2 equations, 19 figures, 13 tables.

Introduction
Related Works
UMBRAE
Architecture
Cross-Subject Alignment
Multimodal Alignment
Brain Prompting Interface
Experiments
Implementation Details
BrainHub
Brain Captioning
Brain Grounding
Brain Retrieval
Visual Decoding
Weakly-Supervised Adaptation
...and 23 more sections

Figures (19)

Figure 1: Multimodal Decoding. By aligning brain features with MLLMs, UMBRAE decodes multimodal cues from brain signals, which allows multiple downstream tasks.
Figure 1: BrainHub Statistics. We illustrate the statistics in (\ref{['fig:supmat_statistics_stats']}) and the mapping relationships in (\ref{['fig:supmat_statistics_map']}) for the categories ontology used in BrainHub, w.r.t. to the original COCO classes. Please zoom in (\ref{['fig:supmat_statistics_stats']}) for details.
Figure 2: Overview of UMBRAE. Our brain encoder includes subject-specific tokenizers and a universal perceive encoder (\ref{['subsec:model_arch']}). Neural signals (fMRI) from multiple subjects are mapped into a common feature space, enabling cross-subject training and weakly-supervised adaptation (\ref{['subsec:training_strategy']}). The brain encoder learns to align neural signals with image features (\ref{['subsec:binding_feature']}). During inference, the learned encoder interacts with MLLMs and performs brain understanding tasks according to given prompts (\ref{['subsec:inference']}).
Figure 2: Brain Captioning Comparison on S1. Baselines for S1 include SDRecon takagi2023improving, BrainCap ferrante2023brain, and OneLLM han2024onellm. 'UMBRAE-S1' refers to our model trained only with subject S1, while 'UMBRAE' denotes the model with cross-subject training.
Figure 3: Example Results. Our method inherits the multimodal capability from MLLMs and thus supports multiple brain captioning and grounding tasks. Different task prompts for the same input brain signal result in unique outcomes.
...and 14 more figures

UMBRAE: Unified Multimodal Brain Decoding

TL;DR

Abstract

UMBRAE: Unified Multimodal Brain Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (19)