FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz
TL;DR
The paper tackles the scarcity of ground-truth reasoning annotations in medical image AI by introducing FunnyNodules, a fully parameterized synthetic dataset with six controllable visual attributes and an attribute-based target rule, plus per-sample ROI masks. This framework enables systematic, model-agnostic evaluation of attribute reasoning, explanation trustworthiness, and prototype-based explanations, all under controlled, scalable conditions. It defines metrics such as the Within-1-Accuracy and a Trust Index to quantify alignment between target predictions and attribute explanations, and demonstrates how attribute-level attention and prototypes can be assessed. While not a substitute for real patient data, FunnyNodules provides a versatile, reproducible platform to study reasoning, robustness, and explanation quality in medical AI and can be extended to explore background effects, imbalance, and model-specific behaviors.
Abstract
Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
