Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality
Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz, Shireen Y. Elhabian, Beatrice S. Knudsen
TL;DR
This work tackles the gap in evaluating virtual immunohistochemistry by proposing an automated, accuracy-grounded framework that complements traditional image-fidelity metrics. It benchmarks sixteen diverse image-to-image translation models (both paired and unpaired) on tile- and whole-slide image data, employing color-deconvolution to generate real vs. virtual DAB masks and segmentation-based metrics (Dice, IoU, Hausdorff distance, TPR, TNR) to quantify staining accuracy. The study reveals that conventional fidelity metrics (FID, KID, PSNR, SSIM, MSE) poorly reflect staining accuracy and pathologist assessments, while segmentation-based metrics provide more interpretable, scalable measures of true staining. It also demonstrates that whole-slide evaluation uncovers tiling artifacts and boundary mislabeling not evident in patch-level analyses, underscoring the need for WSI-aware benchmarking. The proposed pipeline enables pathologist-free, automated evaluation and supports robust, high-throughput benchmarking to accelerate the clinical adoption of virtual staining technologies.
Abstract
Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.
