Table of Contents
Fetching ...

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin I. Bercea, Jun Li, Philipp Raffler, Evamaria O. Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L. Beyrle, Che Liu, Wenjia Bai, Bernhard Kainz, Julia A. Schnabel, Benedikt Wiestler

TL;DR

NOVA targets open-world generalization in medical imaging by providing a real-world, evaluation-only brain MRI benchmark that couples anomaly localization, clinical captioning, and diagnostic reasoning across 281 rare pathologies. It combines 906 cases with expert bounding boxes, radiologist-provided captions, and clinical histories to rigorously test vision-language and large-language models under distribution shifts. Experimental results reveal substantial drops in localization, wording precision, and diagnostic accuracy, highlighting critical gaps in current models for rare disease detection and multimodal clinical reasoning. By offering a high-fidelity, heterogeneous, and clinically grounded stress test, NOVA aims to accelerate the development of robust multimodal systems capable of detecting unknown abnormalities in real-world brain MRI practice.

Abstract

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously $unknown$ categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present $NOVA$, a challenging, real-life $evaluation-only$ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an $extreme$ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

TL;DR

NOVA targets open-world generalization in medical imaging by providing a real-world, evaluation-only brain MRI benchmark that couples anomaly localization, clinical captioning, and diagnostic reasoning across 281 rare pathologies. It combines 906 cases with expert bounding boxes, radiologist-provided captions, and clinical histories to rigorously test vision-language and large-language models under distribution shifts. Experimental results reveal substantial drops in localization, wording precision, and diagnostic accuracy, highlighting critical gaps in current models for rare disease detection and multimodal clinical reasoning. By offering a high-fidelity, heterogeneous, and clinically grounded stress test, NOVA aims to accelerate the development of robust multimodal systems capable of detecting unknown abnormalities in real-world brain MRI practice.

Abstract

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present , a challenging, real-life benchmark of 900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

Paper Structure

This paper contains 16 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the NOVA benchmark. Task 1: Anomaly localization: models predict bounding boxes identifying abnormal regions in brain MRI; ground truth annotations from two independent radiologists are shown. Task 2: Image captioning: models generate a brief diagnostic description from the MRI image. Task 3: Diagnostic reasoning: models predict the final diagnosis by integrating clinical history and image findings. NOVA establishes the first benchmark designed to systematically evaluate vision-language models (VLMs) and large language models (LLMs) for rare anomaly localization, clinical description, and multimodal diagnostic reasoning in brain MRI.
  • Figure 2: Representative brain MRI scans from the NOVA dataset illustrating the diversity of anatomical planes, MRI sequences, and pathological conditions. Radiologist-provided bounding box annotations are overlaid. The examples include rare congenital malformations, toxic and metabolic encephalopathies, and inflammatory or neoplastic lesions—capturing the broad radiological spectrum.
  • Figure 3: Dataset composition and annotation quality in NOVA. (a) Distribution of cases across six diagnostic categories. (b) Inter-rater agreement as mean intersection over union (IoU) between radiologist pairs. (c) Histogram of IoU scores across all scans.
  • Figure 3: Diagnostic reasoning results on NOVA. Diagnostic accuracy is captured by the Top-1 and Top-5 accuracy. Coverage and entropy are extracted from diagnostic reasoning distributions.
  • Figure 4: Examples of model predictions for anomaly grounding on NOVA. Ground truth and model-predicted bounding boxes are shown for Gemini 2.0 Flash, Qwen2.0-VL-72B, and Qwen2.5-VL-72B.
  • ...and 2 more figures