Table of Contents
Fetching ...

A Proper Scoring Rule for Virtual Staining

Samuel Tonks, Steve Hood, Ryan Musso, Ceridwen Hopely, Steve Titus, Minh Doan, Iain Styles, Alexander Krull

TL;DR

This work introduces information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors and evaluates diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and shows that IG can reveal substantial performance differences other metrics cannot.

Abstract

Generative virtual staining (VS) models for high-throughput screening (HTS) can provide an estimated posterior distribution of possible biological feature values for each input and cell. However, when evaluating a VS model, the true posterior is unavailable. Existing evaluation protocols only check the accuracy of the marginal distribution over the dataset rather than the predicted posteriors. We introduce information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors. IG is a strictly proper scoring rule and comes with a sound theoretical motivation allowing for interpretability, and for comparing results across models and features. We evaluate diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and show that IG can reveal substantial performance differences other metrics cannot.

A Proper Scoring Rule for Virtual Staining

TL;DR

This work introduces information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors and evaluates diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and shows that IG can reveal substantial performance differences other metrics cannot.

Abstract

Generative virtual staining (VS) models for high-throughput screening (HTS) can provide an estimated posterior distribution of possible biological feature values for each input and cell. However, when evaluating a VS model, the true posterior is unavailable. Existing evaluation protocols only check the accuracy of the marginal distribution over the dataset rather than the predicted posteriors. We introduce information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors. IG is a strictly proper scoring rule and comes with a sound theoretical motivation allowing for interpretability, and for comparing results across models and features. We evaluate diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and show that IG can reveal substantial performance differences other metrics cannot.
Paper Structure (5 sections, 5 equations, 4 figures)

This paper contains 5 sections, 5 equations, 4 figures.

Figures (4)

  • Figure 1: How to evaluate a predicted posterior distribution with a single target sample? A traditional HTS pipeline (top) uses fluorescence microscopy and extracts feature values $\mathcal{Y}_{i,j}$ for each cell $j$ in image $i$. A cell's true feature value $\mathcal{Y}_{i,j}$ can be seen as a single sample from an inaccessible posterior $P(\mathcal{Y}_{i,j}|\textbf{x}_{i,j})$. Although a generative VS model can produce many samples (bottom) $\hat{\mathcal{Y}}_{i,j}^{1,\ldots,K}$, its learned distribution must be evaluated against the single true value.
  • Figure 2: Qualitative results for single input image. For an unseen bright-field, we show three model samples and the sample corresponding to the median posterior result. Alongside, we display the predicted posterior distribution for the 'upper quartile intensity' feature (F7), highlighting the median values and the target feature value. Pix2PixHD underestimates cell intensity, while cDDPM exhibits greater variability; its posterior is wider and closer to the target value.
  • Figure 3: Evaluation metrics for upper quartile intensity (F7) feature. Marginal KLD and rank distributions are inconclusive and mask differences in model performance. Log-likelihood distribution reveals superior performance of cDDPM.
  • Figure 4: Comparing evaluation metrics across features (F1-F18). Metrics averaged for F10-F18 due to space limitations. Best model for each feature and metric is highlighted as bold/italic. The proposed information gain shows a clear advantage for cDDPM across features, while the other metrics mask the performance difference.