Evaluation of Machine-generated Biomedical Images via A Tally-based Similarity Measure
Frank J. Brooks, Rucha Deshpande
TL;DR
This paper tackles the problem of evaluating synthetic biomedical images when ground-truth references are unavailable. It introduces a Tversky-index–based framework, the Weighted Similarity Index (WSI), which binarizes diverse image features via tolerance intervals and weights them to compute a bounded similarity score between 0 and 1. Through real and simulated data (including CoBaLT and WonoST stochastic models, CheXpert reconstructions, grayscale analyses, and Img2Vec baselines), the authors demonstrate that the tally-based WSI captures intuitive similarities and detects deficiencies that distance-based metrics often miss, particularly in ablation, perturbation, and reconstruction scenarios. The approach emphasizes task-relevant feature selection, interpretability, and robustness, with practical caveats and potential indications of hallucination via self-similarity analyses. Overall, the work provides a principled, interpretable alternative to pixel-wise or feature-space distances for evaluating generative biomedical image quality and encourages careful feature design aligned with clinical or domain-specific utility.
Abstract
Super-resolution, in-painting, whole-image generation, unpaired style-transfer, and network-constrained image reconstruction each include an aspect of machine-learned image synthesis where the actual ground truth is not known at time of use. It is generally difficult to quantitatively and authoritatively evaluate the quality of synthetic images; however, in mission-critical biomedical scenarios robust evaluation is paramount. In this work, all practical image-to-image comparisons really are relative qualifications, not absolute difference quantifications; and, therefore, meaningful evaluation of generated image quality can be accomplished using the Tversky Index, which is a well-established measure for assessing perceptual similarity. This evaluation procedure is developed and then demonstrated using multiple image data sets, both real and simulated. The main result is that when the subjectivity and intrinsic deficiencies of any feature-encoding choice are put upfront, Tversky's method leads to intuitive results, whereas traditional methods based on summarizing distances in deep feature spaces do not.
