Table of Contents
Fetching ...

Decoding fMRI Data into Captions using Prefix Language Modeling

Vyacheslav Shen, Kassymzhomart Kunanbayev, Dae-Shik Kim

TL;DR

This work tackles decoding fMRI signals into natural-language captions while avoiding data contamination from COCO-trained captioning models. It introduces a two-stage approach: map fMRI to a DINOv2 image embedding using 3D CNNs or Ridge, then feed the embedding's [CLS] prefix to GPT-2 for caption generation, reducing compute. The study shows CNN-based mappings outperform linear methods and achieve competitive caption quality with orders of magnitude fewer parameters, and zero COCO contamination due to DINOv2. The findings suggest efficient brain decoding pipelines and potential extensions to complex tasks like visual question answering.

Abstract

With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.

Decoding fMRI Data into Captions using Prefix Language Modeling

TL;DR

This work tackles decoding fMRI signals into natural-language captions while avoiding data contamination from COCO-trained captioning models. It introduces a two-stage approach: map fMRI to a DINOv2 image embedding using 3D CNNs or Ridge, then feed the embedding's [CLS] prefix to GPT-2 for caption generation, reducing compute. The study shows CNN-based mappings outperform linear methods and achieve competitive caption quality with orders of magnitude fewer parameters, and zero COCO contamination due to DINOv2. The findings suggest efficient brain decoding pipelines and potential extensions to complex tasks like visual question answering.

Abstract

With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
Paper Structure (9 sections, 1 figure, 2 tables)

This paper contains 9 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The scheme of our method. GPT-2 base model was used as the language model, while fMRI-DINOv2 embedding mapping