Table of Contents
Fetching ...

Exploring The Visual Feature Space for Multimodal Neural Decoding

Weihao Xia, Cengiz Oztireli

TL;DR

The paper tackles the challenge of fine-grained brain decoding for visual content by proposing VINDEX, a zero-shot framework that aligns fMRI signals with multiple pre-trained vision feature spaces inside multimodal LLMs. It introduces a denoiser-based alignment objective to robustly map brain activations to image features, enabling decoding across granularities without extra textual or spatial annotations. To evaluate fine-grained perceptual understanding, the authors propose MG-BrainDUB, a benchmark focusing on detailed captions and salient question-answering with element-level metrics for objects, attributes, and relations. Empirical results across concept localization, concise and descriptive captioning, and complex reasoning demonstrate that selective feature spaces (notably Nested Features with 9 tokens) balance detail and brain constraints, yielding strong zero-shot performance and cross-subject generalization. The work advances zero-shot multimodal brain decoding and provides a practical evaluation framework for multi-granular neuro-visual understanding using foundation vision models.

Abstract

The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at https://github.com/weihaox/VINDEX.

Exploring The Visual Feature Space for Multimodal Neural Decoding

TL;DR

The paper tackles the challenge of fine-grained brain decoding for visual content by proposing VINDEX, a zero-shot framework that aligns fMRI signals with multiple pre-trained vision feature spaces inside multimodal LLMs. It introduces a denoiser-based alignment objective to robustly map brain activations to image features, enabling decoding across granularities without extra textual or spatial annotations. To evaluate fine-grained perceptual understanding, the authors propose MG-BrainDUB, a benchmark focusing on detailed captions and salient question-answering with element-level metrics for objects, attributes, and relations. Empirical results across concept localization, concise and descriptive captioning, and complex reasoning demonstrate that selective feature spaces (notably Nested Features with 9 tokens) balance detail and brain constraints, yielding strong zero-shot performance and cross-subject generalization. The work advances zero-shot multimodal brain decoding and provides a practical evaluation framework for multi-granular neuro-visual understanding using foundation vision models.

Abstract

The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at https://github.com/weihaox/VINDEX.

Paper Structure

This paper contains 44 sections, 7 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Feature Spaces. We aim to capture multi-granular representations for zero-shot multimodal brain decoding by aligning brain signals across different feature spaces in MLLM’s visual component. The explored representative feature spaces include: (a) Single Encoder, which utilizes a single encoder to extract selected features; (b) Mixture of Encoders, using hybrid featured from different vision experts specialized in task-specific domains; (c) Aggregated Feature, fusing features from different layers within the same image encoder; and (d) Nested Features, downscaling visual representations in a hierarchical coarse-to-fine structure with different perceptual granularities.
  • Figure 2: Training Procedure with Denoiser.
  • Figure 3: Method Overview. Once a pre-trained brain encoder is available, brain signals can be input to obtain predicted brain features. These brain features are then fed into the connector and LLM for multimodal brain interaction. Our method follows a similar overview to UMBRAE xia2024umbrae but differentiates itself through the use of different vision encoders, alignment strategies, and support for additional tasks.
  • Figure 4: Example NSD Images. Below, we present detailed captions generated by MLLMs using images as input, alongside captions from our method using different feature spaces with brain signals as input.
  • Figure 5: Denoiser as a Training Stabilizer. (a) The vanilla regression loss decreases but exhibits significant oscillation; (b) The training process becomes less oscillatory with the incorporation of diffusion loss, which stabilizes the training and accelerates convergence.
  • ...and 1 more figures