Exploring The Visual Feature Space for Multimodal Neural Decoding
Weihao Xia, Cengiz Oztireli
TL;DR
The paper tackles the challenge of fine-grained brain decoding for visual content by proposing VINDEX, a zero-shot framework that aligns fMRI signals with multiple pre-trained vision feature spaces inside multimodal LLMs. It introduces a denoiser-based alignment objective to robustly map brain activations to image features, enabling decoding across granularities without extra textual or spatial annotations. To evaluate fine-grained perceptual understanding, the authors propose MG-BrainDUB, a benchmark focusing on detailed captions and salient question-answering with element-level metrics for objects, attributes, and relations. Empirical results across concept localization, concise and descriptive captioning, and complex reasoning demonstrate that selective feature spaces (notably Nested Features with 9 tokens) balance detail and brain constraints, yielding strong zero-shot performance and cross-subject generalization. The work advances zero-shot multimodal brain decoding and provides a practical evaluation framework for multi-granular neuro-visual understanding using foundation vision models.
Abstract
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at https://github.com/weihaox/VINDEX.
