Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy

Weijian Mai; Jian Zhang; Pengfei Fang; Zhijun Zhang

Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy

Weijian Mai, Jian Zhang, Pengfei Fang, Zhijun Zhang

TL;DR

This survey defines AIGC-Brain as the paradigm of decoding non-invasive brain signals into perceptual content to guide multimodal generation, focusing on passive tasks such as IBI, VBV, SBS, MBM, IBT, VBT, and SBT. It presents a six-type methodology taxonomy (Map, BPM, BPFA, MTF, End-to-End, CAEA) and reviews task-specific pipelines, datasets, and representative models, emphasizing how brain priors interface with pretrained AIGC decoders, particularly diffusion-based generators. The work provides qualitative and quantitative benchmarks across datasets (fMRI, EEG, MEG) and modalities, highlighting progress in high-fidelity, semantically coherent reconstructions while identifying challenges in data scarcity, interpretability, and real-time deployment. It concludes with prospects for unified Brain-to-Any multimodal synthesis and emphasizes the need for improved cross-modal alignment, dataset scale, and multimodal decoding capabilities to advance BCI-enabled content generation and understanding of neural perception.

Abstract

In the era of Artificial Intelligence Generated Content (AIGC), conditional multimodal synthesis technologies (e.g., text-to-image, text-to-video, text-to-audio, etc) are gradually reshaping the natural content in the real world. The key to multimodal synthesis technology is to establish the mapping relationship between different modalities. Brain signals, serving as potential reflections of how the brain interprets external information, exhibit a distinctive One-to-Many correspondence with various external modalities. This correspondence makes brain signals emerge as a promising guiding condition for multimodal content synthesis. Brian-conditional multimodal synthesis refers to decoding brain signals back to perceptual experience, which is crucial for developing practical brain-computer interface systems and unraveling complex mechanisms underlying how the brain perceives and comprehends external stimuli. This survey comprehensively examines the emerging field of AIGC-based Brain-conditional Multimodal Synthesis, termed AIGC-Brain, to delineate the current landscape and future directions. To begin, related brain neuroimaging datasets, functional brain regions, and mainstream generative models are introduced as the foundation of AIGC-Brain decoding and analysis. Next, we provide a comprehensive taxonomy for AIGC-Brain decoding models and present task-specific representative work and detailed implementation strategies to facilitate comparison and in-depth analysis. Quality assessments are then introduced for both qualitative and quantitative evaluation. Finally, this survey explores insights gained, providing current challenges and outlining prospects of AIGC-Brain. Being the inaugural survey in this domain, this paper paves the way for the progress of AIGC-Brain research, offering a foundational overview to guide future work.

Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy

TL;DR

Abstract

Paper Structure (47 sections, 4 equations, 9 figures, 8 tables)

This paper contains 47 sections, 4 equations, 9 figures, 8 tables.

Introduction
Neuroimaging Datasets
Image Datasets
Video Datasets
Video&Speech Datasets
Sound&Speech Datasets
Music Datasets
Brain Regions
Visual Cortex
Early Visual Cortex
Higher Visual Cortex
Auditory Cortex
Language Cortex
Generative Models
Diffusion Models
...and 32 more sections

Figures (9)

Figure 1: Brain-Conditional Multimodal Synthesis via AIGC-Brain Decoder. Sensory stimuli comprising visual stimuli (Image (I), Video (V)) and audio stimuli (Music (M), Speech/Sound (S)) from the external world are first encoded to non-invasive brain signals (EEG, fMRI, or MEG) and then decoded back to perceptual experience via the AIGC-Brain decoder. This survey focuses on passive brain-conditional multimodal synthesis tasks including Image-Brain-Image (IBI), Video-Brain-Video (VBV), Sound-Brain-Sound (SBS), Music-Brain-Music (MBM), Image-Brain-Text (IBT), Video-Brain-Text (VBT), and Speech-Brain-Text (SBT), where IBI refers to image synthesis tasks conditioned on brain signals evoked by image stimuli.
Figure 2: EEG 10-10 channel system and brain regions. Pink: Occipital lobe; Green: Temporal lobe; Yellow: Parietal lobe; Blue: Frontal lobe.
Figure 3: Different types of methods for AIGC-Brain tasks. Map: Mapping; MTF: Map&Train&Finetune; E2E: End-to-End; BPM: Brain-Pretrain&Map; BPFA: Brain-Pretrain&Finetune&Align; AEA: Auto-Encoder&Align. Brain pretraining on large-scale neuroimaging datasets (e.g., HCPHCP and MOABBMOABB) is the first stage of BPFA and BPM methods.
Figure 4: Brain-Conditional I2I-LDMs: State-of-the-art framework for IBI tasks.
Figure 5: Mind-VideoMind-Video: State-of-the-art model for VBV tasks.
...and 4 more figures

Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy

TL;DR

Abstract

Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy

Authors

TL;DR

Abstract

Table of Contents

Figures (9)