Table of Contents
Fetching ...

Evaluation of Machine-generated Biomedical Images via A Tally-based Similarity Measure

Frank J. Brooks, Rucha Deshpande

TL;DR

This paper tackles the problem of evaluating synthetic biomedical images when ground-truth references are unavailable. It introduces a Tversky-index–based framework, the Weighted Similarity Index (WSI), which binarizes diverse image features via tolerance intervals and weights them to compute a bounded similarity score between 0 and 1. Through real and simulated data (including CoBaLT and WonoST stochastic models, CheXpert reconstructions, grayscale analyses, and Img2Vec baselines), the authors demonstrate that the tally-based WSI captures intuitive similarities and detects deficiencies that distance-based metrics often miss, particularly in ablation, perturbation, and reconstruction scenarios. The approach emphasizes task-relevant feature selection, interpretability, and robustness, with practical caveats and potential indications of hallucination via self-similarity analyses. Overall, the work provides a principled, interpretable alternative to pixel-wise or feature-space distances for evaluating generative biomedical image quality and encourages careful feature design aligned with clinical or domain-specific utility.

Abstract

Super-resolution, in-painting, whole-image generation, unpaired style-transfer, and network-constrained image reconstruction each include an aspect of machine-learned image synthesis where the actual ground truth is not known at time of use. It is generally difficult to quantitatively and authoritatively evaluate the quality of synthetic images; however, in mission-critical biomedical scenarios robust evaluation is paramount. In this work, all practical image-to-image comparisons really are relative qualifications, not absolute difference quantifications; and, therefore, meaningful evaluation of generated image quality can be accomplished using the Tversky Index, which is a well-established measure for assessing perceptual similarity. This evaluation procedure is developed and then demonstrated using multiple image data sets, both real and simulated. The main result is that when the subjectivity and intrinsic deficiencies of any feature-encoding choice are put upfront, Tversky's method leads to intuitive results, whereas traditional methods based on summarizing distances in deep feature spaces do not.

Evaluation of Machine-generated Biomedical Images via A Tally-based Similarity Measure

TL;DR

This paper tackles the problem of evaluating synthetic biomedical images when ground-truth references are unavailable. It introduces a Tversky-index–based framework, the Weighted Similarity Index (WSI), which binarizes diverse image features via tolerance intervals and weights them to compute a bounded similarity score between 0 and 1. Through real and simulated data (including CoBaLT and WonoST stochastic models, CheXpert reconstructions, grayscale analyses, and Img2Vec baselines), the authors demonstrate that the tally-based WSI captures intuitive similarities and detects deficiencies that distance-based metrics often miss, particularly in ablation, perturbation, and reconstruction scenarios. The approach emphasizes task-relevant feature selection, interpretability, and robustness, with practical caveats and potential indications of hallucination via self-similarity analyses. Overall, the work provides a principled, interpretable alternative to pixel-wise or feature-space distances for evaluating generative biomedical image quality and encourages careful feature design aligned with clinical or domain-specific utility.

Abstract

Super-resolution, in-painting, whole-image generation, unpaired style-transfer, and network-constrained image reconstruction each include an aspect of machine-learned image synthesis where the actual ground truth is not known at time of use. It is generally difficult to quantitatively and authoritatively evaluate the quality of synthetic images; however, in mission-critical biomedical scenarios robust evaluation is paramount. In this work, all practical image-to-image comparisons really are relative qualifications, not absolute difference quantifications; and, therefore, meaningful evaluation of generated image quality can be accomplished using the Tversky Index, which is a well-established measure for assessing perceptual similarity. This evaluation procedure is developed and then demonstrated using multiple image data sets, both real and simulated. The main result is that when the subjectivity and intrinsic deficiencies of any feature-encoding choice are put upfront, Tversky's method leads to intuitive results, whereas traditional methods based on summarizing distances in deep feature spaces do not.

Paper Structure

This paper contains 24 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Realizations from the stochastic image models of fluorescence microscopy. Top: the correlated-background lumpy triple (CoBaLT) model Bottom: Worley-noise soft tissue (WonoST) model. The RGB composite is shown in the first column and each subsequent column is the view in only the red, green, or, blue channel. The contrast of each channel has been enhanced for display.
  • Figure 2: Examples of whole-feature ablation in the CoBaLT stochastic context model. Top left: an unadulterated realization. Top right: same realization with the spiny structures removed. Bottom left: same realization with a scrambled background texture. Bottom right: same realization with both spiny structures removed and background texture scrambled.
  • Figure 3: Examples of feature reduction in the WonoST stochastic context model. Top: red channel. Bottom: blue channel. Perturbation of the green channel is straightforward and needs no illustration. The left column is the unperturbed channel and the right column is the largest perturbation (32%). Note the many, subtle differences observable via close inspection. All images have been contrast enhanced for display.
  • Figure 4: Examples of similar images submitted to the AAPM Grand Challenge on deep generative models deshpande2025report. Each image is a 256x256-pixel region taken from a larger elliptical shape. At the top-left is training data, top-right is Rank 1, bottom-left is Rank 2 and bottom-right is Rank 3; these ranks are subject to the present work and do not correspond with the contest ranking. Differences from the training image in the bright "skeleton," the foreground shape, and texture are obvious throughout the ranks.
  • Figure 5: Boxplot of the similarity between random pairs comprising a subject image (x-axis) and a training image. The Rank 1 has nearly the same range of values as does comparing training data to other training data (first boxes). Rank 2 has much lower similarity, on average, implying that the ensemble comprises many images with features not seen in the training data. Essentially none of the similarity-defining features are seen in Rank 3 images. The clear distinction between the ensembles is not obvious from the distance alone.
  • ...and 4 more figures