Multigranular Evaluation for Brain Visual Decoding
Weihao Xia, Cengiz Oztireli
TL;DR
This work tackles the lack of discriminative, neuroscience-grounded evaluation in brain visual decoding by introducing BASIC, a multigranular framework that jointly measures structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. It separates evaluation into BASIC-L (low-level structural: salient, semantic, instance, part) and BASIC-H (high-level semantic: objects, attributes, relations, scene context) using mask-based segmentation and LLM-driven semantic representations. The framework uses a three-step semantic matching and Grounded-SAM2 segmentation to provide interpretable diagnostics, and demonstrates robustness across multiple datasets and modalities with model-agnostic applicability. By benchmarking diverse decoding methods under a unified protocol, BASIC enables finer discrimination between models, reveals semantic versus structural trade-offs, and establishes a scalable, open benchmark for brain-to-vision research.
Abstract
Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for evaluating brain visual decoding methods.
