Table of Contents
Fetching ...

LaVCa: LLM-assisted Visual Cortex Captioning

Takuya Matsuyama, Shinji Nishimoto, Yu Takagi

TL;DR

LaVCa tackles the interpretability gap of voxel-level brain encoding by using large language models to generate open-ended, caption-based descriptions of voxel selectivity in the visual cortex. The approach separates optimal-image identification from caption generation, leveraging CLIP-based encodings and multimodal LLMs to produce richer, sentence-level captions through keyword extraction and a sentence composer. Empirical results on the NSD dataset show LaVCa outperforms BrainSCUBA in sentence- and image-level brain activity predictions and yields substantially greater lexical and semantic diversity, revealing fine-grained inter-voxel and intra-voxel representations within ROIs like OFA and PPA. The work demonstrates the potential of LLM-driven textual representations to enhance the interpretability of brain representations and suggests avenues for extending to multimodal and higher-order cognitive processes.

Abstract

Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations. Please check out our webpage at https://sites.google.com/view/lavca-llm/

LaVCa: LLM-assisted Visual Cortex Captioning

TL;DR

LaVCa tackles the interpretability gap of voxel-level brain encoding by using large language models to generate open-ended, caption-based descriptions of voxel selectivity in the visual cortex. The approach separates optimal-image identification from caption generation, leveraging CLIP-based encodings and multimodal LLMs to produce richer, sentence-level captions through keyword extraction and a sentence composer. Empirical results on the NSD dataset show LaVCa outperforms BrainSCUBA in sentence- and image-level brain activity predictions and yields substantially greater lexical and semantic diversity, revealing fine-grained inter-voxel and intra-voxel representations within ROIs like OFA and PPA. The work demonstrates the potential of LLM-driven textual representations to enhance the interpretability of brain representations and suggests avenues for extending to multimodal and higher-order cognitive processes.

Abstract

Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations. Please check out our webpage at https://sites.google.com/view/lavca-llm/

Paper Structure

This paper contains 28 sections, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Illustration of our paper. Our proposed method, LaVCa, generates text captions that explain voxel selectivity and surpass existing approaches, such as one-hot vectors and BrainSCUBA, enabling a more detailed description of the properties of visual cortex voxels.
  • Figure 2: The relationship between sentence-level prediction performance and the number of words in voxel captions for a single subject (subj01). The number following "LaVCa" indicates the number of optimal images used for summarization, while the number following "Concat" indicates the number of concatenated captions from the optimal images. Error bars indicate the standard error. LaVCa explains the properties of voxels well using a small number of words.
  • Figure 3: Architecture of LaVCa. (a) We construct a voxel-wise encoding model for a human subject’s brain activity data (measured using fMRI) while viewing images, using CLIP -Vision latent representations. The encoding weight is obtained through ridge regression. (b) We identify the optimal images for a given voxel by calculating the inner product between the CLIP-Vision latent representations of external image datasets and the voxel’s trained encoding weight, selecting the top-N images (the "optimal image set") that produce the highest predicted activation. (c) Next, we use a Multimodal LLM (MLLM) to generate captions for each optimal image set, allowing an LLM to interpret them. (d) Finally, we prompt an LLM to extract keywords from the captions, filter these keywords, and feed them into a "Sentence Composer," producing a concise voxel caption.
  • Figure 4: Mapping of brain activity prediction accuracy (subj01). (a) The sentence-level prediction performance is projected onto inflated cortical surfaces (top: lateral, medial, and dorsal views) and flattened cortical surfaces (bottom, with the occipital areas at the center) for both hemispheres. Voxels with significant prediction performance are color-coded (all colored voxels $P<0.05$, FDR corrected). The white outlines indicate the ROIs that are among the top two in terms of the total voxel count across subjects for each semantic category—Body (Extra Striate Body Area; EBA, and Fusiform Body Area; FBA-2), Face (Fusiform Face Area; FFA-1, and Occipital Face Area; OFA), and Places (Parahippocampal Place Area; PPA, and Occipital Place Area; OPA). Word areas are shown in Figure \ref{['appendix:sentence_cc_flatmap']}. (b) A comparison of sentence-level prediction performance between our method, LaVCa, and the existing method, BrainSCUBA on the flattened cortical surface. If only one model exhibits significant prediction performance for a given voxel, the other model's performance for that voxel is set to zero and color-coded accordingly.
  • Figure 5: Interpretation of LaVCa captions in OFA. (a) The UMAP projection of caption text for all participants, visualized on a flatmap (top). A word cloud of the 100 most frequent words in these captions (middle), colored according to their location in the UMAP space. A bar graph of the top 10 most frequent words (bottom). (b) Visualization of the top two captions (by accuracy) for eight clusters on the flatmap (subj02). The images generated for each caption appear to the left or above the text. Voxels are connected to their corresponding captions and images by lines. The color of each caption and image border reflects the average UMAP color of all voxels in the cluster.
  • ...and 15 more figures