Table of Contents
Fetching ...

SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

Juhyeon Park, Peter Yongho Kim, Jiook Cha, Shinjae Yoo, Taesup Moon

TL;DR

This work targets semantic evaluation for visual brain decoding, revealing that existing metrics poorly align with human judgments. It introduces SEED, a composite metric that blends Object F1, Cap-Sim, and a correlation-based EffNet term to better capture image semantics, and couples it with a novel pairwise hinge loss to improve semantic alignment during model training. Empirical results on NSD show SEED achieves the strongest agreement with human judgments and exposes frequent semantic near-misses in state-of-the-art reconstructions. The authors also open-source human ratings and provide a practical training loss, offering a pathway to more faithful semantic brain decoding and evaluation in future work.

Abstract

We present SEED (\textbf{Se}mantic \textbf{E}valuation for Visual Brain \textbf{D}ecoding), a novel metric for evaluating the semantic decoding performance of visual brain decoding models. It integrates three complementary metrics, each capturing a different aspect of semantic similarity between images. Using carefully crowd-sourced human judgment data, we demonstrate that SEED achieves the highest alignment with human evaluations, outperforming other widely used metrics. Through the evaluation of existing visual brain decoding models, we further reveal that crucial information is often lost in translation, even in state-of-the-art models that achieve near-perfect scores on existing metrics. To facilitate further research, we open-source the human judgment data, encouraging the development of more advanced evaluation methods for brain decoding models. Additionally, we propose a novel loss function designed to enhance semantic decoding performance by leveraging the order of pairwise cosine similarity in CLIP image embeddings. This loss function is compatible with various existing methods and has been shown to consistently improve their semantic decoding performances when used for training, with respect to both existing metrics and SEED.

SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

TL;DR

This work targets semantic evaluation for visual brain decoding, revealing that existing metrics poorly align with human judgments. It introduces SEED, a composite metric that blends Object F1, Cap-Sim, and a correlation-based EffNet term to better capture image semantics, and couples it with a novel pairwise hinge loss to improve semantic alignment during model training. Empirical results on NSD show SEED achieves the strongest agreement with human judgments and exposes frequent semantic near-misses in state-of-the-art reconstructions. The authors also open-source human ratings and provide a practical training loss, offering a pathway to more faithful semantic brain decoding and evaluation in future work.

Abstract

We present SEED (\textbf{Se}mantic \textbf{E}valuation for Visual Brain \textbf{D}ecoding), a novel metric for evaluating the semantic decoding performance of visual brain decoding models. It integrates three complementary metrics, each capturing a different aspect of semantic similarity between images. Using carefully crowd-sourced human judgment data, we demonstrate that SEED achieves the highest alignment with human evaluations, outperforming other widely used metrics. Through the evaluation of existing visual brain decoding models, we further reveal that crucial information is often lost in translation, even in state-of-the-art models that achieve near-perfect scores on existing metrics. To facilitate further research, we open-source the human judgment data, encouraging the development of more advanced evaluation methods for brain decoding models. Additionally, we propose a novel loss function designed to enhance semantic decoding performance by leveraging the order of pairwise cosine similarity in CLIP image embeddings. This loss function is compatible with various existing methods and has been shown to consistently improve their semantic decoding performances when used for training, with respect to both existing metrics and SEED.

Paper Structure

This paper contains 33 sections, 8 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Current evaluation metrics assess the semantic similarity between ground-truth and reconstructions in a way that significantly differs from human judgment, often giving relatively high scores to reconstructions that are semantically misaligned.
  • Figure 2: The overall process for calculating SEED.
  • Figure 3: The heatmap of correlations between metric combinations and human evaluation, measured by Kendall's Tau-b. The green outline indicates combinations within current metrics.
  • Figure 4: Examples and rankings (out of 1000 pairs) of worst-case judgments for (a) Object F1, (b) Cap-Sim, and (c) EffNet.
  • Figure 5: Examples of the semantic near-miss phenomenon
  • ...and 6 more figures