Table of Contents
Fetching ...

Multigranular Evaluation for Brain Visual Decoding

Weihao Xia, Cengiz Oztireli

TL;DR

This work tackles the lack of discriminative, neuroscience-grounded evaluation in brain visual decoding by introducing BASIC, a multigranular framework that jointly measures structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. It separates evaluation into BASIC-L (low-level structural: salient, semantic, instance, part) and BASIC-H (high-level semantic: objects, attributes, relations, scene context) using mask-based segmentation and LLM-driven semantic representations. The framework uses a three-step semantic matching and Grounded-SAM2 segmentation to provide interpretable diagnostics, and demonstrates robustness across multiple datasets and modalities with model-agnostic applicability. By benchmarking diverse decoding methods under a unified protocol, BASIC enables finer discrimination between models, reveals semantic versus structural trade-offs, and establishes a scalable, open benchmark for brain-to-vision research.

Abstract

Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for evaluating brain visual decoding methods.

Multigranular Evaluation for Brain Visual Decoding

TL;DR

This work tackles the lack of discriminative, neuroscience-grounded evaluation in brain visual decoding by introducing BASIC, a multigranular framework that jointly measures structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. It separates evaluation into BASIC-L (low-level structural: salient, semantic, instance, part) and BASIC-H (high-level semantic: objects, attributes, relations, scene context) using mask-based segmentation and LLM-driven semantic representations. The framework uses a three-step semantic matching and Grounded-SAM2 segmentation to provide interpretable diagnostics, and demonstrates robustness across multiple datasets and modalities with model-agnostic applicability. By benchmarking diverse decoding methods under a unified protocol, BASIC enables finer discrimination between models, reveals semantic versus structural trade-offs, and establishes a scalable, open benchmark for brain-to-vision research.

Abstract

Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for evaluating brain visual decoding methods.

Paper Structure

This paper contains 56 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: BASIC evaluates decoded reconstructions along two axes: high-level semantic (BASIC-H) and low-level structural (BASIC-L) similarities. For the semantic axis (inferential and contextual), we extract and compare structured representations from reconstructed and ground-truth images. For the structural axis, we compute mask-based matching across fine-grained segmentation types of identified scenes and objects: salient, semantic, instance, and parts.
  • Figure 2: BASIC performance.
  • Figure 3: BASIC demonstrates stable and consistent performance in method evaluation across variations in (a) MLLMs liu2023visual, (b) prompting strategies, and (c) thresholds for box and text ren2024grounded.
  • Figure 4: Qualitative examples with BASIC-H and BASIC-L scores, including sub-indicators.
  • Figure S1: The visualization of multigranular segmentation. Reference images and structured annotations can be found in \ref{['fig:supmat_nsd_recon']} and \ref{['tab:supmat_sam_label']}, respectively.
  • ...and 3 more figures