Table of Contents
Fetching ...

Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey)

Subba Reddy Oota, Zijiao Chen, Manish Gupta, Raju S. Bapi, Gael Jobard, Frederic Alexandre, Xavier Hinaut

TL;DR

This survey outlines how deep neural networks enable brain encoding and decoding across language, vision, and audition using naturalistic datasets. It details stimulus representations from text, images, sounds, and multimodal models, and reviews encoding/decoding pipelines, metrics, and regional brain mappings. Key contributions include a taxonomy of encoding/decoding methods, evaluation frameworks (including RSA/CKA/NDS and noise ceilings), and insights into how pretrained and fine-tuned models align with human brain activity, along with ethical considerations and future directions. The work highlights practical implications for brain-computer interfaces, neuro-AI research, and the development of cognitively plausible AI systems toward more interpretable, robust, and human-aligned models.

Abstract

Can artificial intelligence unlock the secrets of the human brain? How do the inner mechanisms of deep learning models relate to our neural circuits? Is it possible to enhance AI by tapping into the power of brain recordings? These captivating questions lie at the heart of an emerging field at the intersection of neuroscience and artificial intelligence. Our survey dives into this exciting domain, focusing on human brain recording studies and cutting-edge cognitive neuroscience datasets that capture brain activity during natural language processing, visual perception, and auditory experiences. We explore two fundamental approaches: encoding models, which attempt to generate brain activity patterns from sensory inputs; and decoding models, which aim to reconstruct our thoughts and perceptions from neural signals. These techniques not only promise breakthroughs in neurological diagnostics and brain-computer interfaces but also offer a window into the very nature of cognition. In this survey, we first discuss popular representations of language, vision, and speech stimuli, and present a summary of neuroscience datasets. We then review how the recent advances in deep learning transformed this field, by investigating the popular deep learning based encoding and decoding architectures, noting their benefits and limitations across different sensory modalities. From text to images, speech to videos, we investigate how these models capture the brain's response to our complex, multimodal world. While our primary focus is on human studies, we also highlight the crucial role of animal models in advancing our understanding of neural mechanisms. Throughout, we mention the ethical implications of these powerful technologies, addressing concerns about privacy and cognitive liberty. We conclude with a summary and discussion of future trends in this rapidly evolving field.

Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey)

TL;DR

This survey outlines how deep neural networks enable brain encoding and decoding across language, vision, and audition using naturalistic datasets. It details stimulus representations from text, images, sounds, and multimodal models, and reviews encoding/decoding pipelines, metrics, and regional brain mappings. Key contributions include a taxonomy of encoding/decoding methods, evaluation frameworks (including RSA/CKA/NDS and noise ceilings), and insights into how pretrained and fine-tuned models align with human brain activity, along with ethical considerations and future directions. The work highlights practical implications for brain-computer interfaces, neuro-AI research, and the development of cognitively plausible AI systems toward more interpretable, robust, and human-aligned models.

Abstract

Can artificial intelligence unlock the secrets of the human brain? How do the inner mechanisms of deep learning models relate to our neural circuits? Is it possible to enhance AI by tapping into the power of brain recordings? These captivating questions lie at the heart of an emerging field at the intersection of neuroscience and artificial intelligence. Our survey dives into this exciting domain, focusing on human brain recording studies and cutting-edge cognitive neuroscience datasets that capture brain activity during natural language processing, visual perception, and auditory experiences. We explore two fundamental approaches: encoding models, which attempt to generate brain activity patterns from sensory inputs; and decoding models, which aim to reconstruct our thoughts and perceptions from neural signals. These techniques not only promise breakthroughs in neurological diagnostics and brain-computer interfaces but also offer a window into the very nature of cognition. In this survey, we first discuss popular representations of language, vision, and speech stimuli, and present a summary of neuroscience datasets. We then review how the recent advances in deep learning transformed this field, by investigating the popular deep learning based encoding and decoding architectures, noting their benefits and limitations across different sensory modalities. From text to images, speech to videos, we investigate how these models capture the brain's response to our complex, multimodal world. While our primary focus is on human studies, we also highlight the crucial role of animal models in advancing our understanding of neural mechanisms. Throughout, we mention the ethical implications of these powerful technologies, addressing concerns about privacy and cognitive liberty. We conclude with a summary and discussion of future trends in this rapidly evolving field.
Paper Structure (52 sections, 7 equations, 22 figures, 7 tables)

This paper contains 52 sections, 7 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: This figure summarizes overall encoding and decoding pipelines with different neuroimaging modalities (fMRI, MEG and EEG), Stimulus modalities (language, audio, visual, and multimodal), and Tasks (reading, listening, watching static images or videos, with or without audio). Further, the pipelines also incorporate stimulus representations obtained from different types of DNN, mapping the DNN and Brain representations via linear or non-linear models, and evaluation measures estimating the performance of encoding/decoding models. Visualization tools facilitate intuitive presentation of the results.
  • Figure 2: Overview of different brain–machine interfacing methods and their spatial and temporal resolution. Methods included: electroencephalography (EEG), magnetoencephalography (MEG), near-infrared spectroscopy (NIRS), functional magnetic resonance imaging (fMRI), electrocorticography (ECoG), microelectrode array (MEA) recordings and single microelectrode (ME) recordings. Figure is adapted from van2009brain, and used with permission from the respective authors.
  • Figure 3: Representative Samples of Naturalistic Brain Datasets. (Left) Comparison of brain activity patterns recorded during reading and listening to the same narrative, illustrating modality-specific and shared neural responses deniz2019representation. (Right) Examples of diverse naturalistic stimuli used in various public neuroimaging repositories: complex visual scenes from BOLD5000 chang2019bold5000, video frames from ShortClips huth2022gallant, natural images from the Natural Scenes Dataset (NSD) allen2022massive, and multimodal stimuli (text and images) from the Pereira dataset pereira2018toward. Adapted with permission from the respective authors.
  • Figure 4: (a) Context Representation of Words in Language Models. This figure illustrates how past and future context is constructed for different word orders. Using the word "vehicle" as an example, we demonstrate how preceding words (past context) and succeeding words (future context) are considered for various context lengths. (b) Extraction of Image Representations for Brain-Computer Interface Models. This figure illustrates the process of extracting layer-wise image representations from Convolutional Neural Network (CNN) models. These representations have been extensively studied in prior research yamins2014performancehorikawa2017generic for their effectiveness in both brain encoding (predicting neural responses from visual stimuli) and decoding (reconstructing visual stimuli from brain activity) models. The figure demonstrates how different layers of a CNN, from early layers capturing low-level features to deeper layers representing more abstract concepts, can be utilized to understand and predict brain responses to visual stimuli. Figure is adapted from horikawa2017generic, and used with permission from the respective authors.
  • Figure 5: Extraction of contextualized speech representations: Representation of the last frame within each window allows for the capture of temporal dynamics and contextual nuances in the speech signal. The length of the time window is typically varied from 16 to 64 secs, with strides ranging from 10 to 100 milliseconds.
  • ...and 17 more figures