Table of Contents
Fetching ...

Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz, Shireen Y. Elhabian, Beatrice S. Knudsen

TL;DR

This work tackles the gap in evaluating virtual immunohistochemistry by proposing an automated, accuracy-grounded framework that complements traditional image-fidelity metrics. It benchmarks sixteen diverse image-to-image translation models (both paired and unpaired) on tile- and whole-slide image data, employing color-deconvolution to generate real vs. virtual DAB masks and segmentation-based metrics (Dice, IoU, Hausdorff distance, TPR, TNR) to quantify staining accuracy. The study reveals that conventional fidelity metrics (FID, KID, PSNR, SSIM, MSE) poorly reflect staining accuracy and pathologist assessments, while segmentation-based metrics provide more interpretable, scalable measures of true staining. It also demonstrates that whole-slide evaluation uncovers tiling artifacts and boundary mislabeling not evident in patch-level analyses, underscoring the need for WSI-aware benchmarking. The proposed pipeline enables pathologist-free, automated evaluation and supports robust, high-throughput benchmarking to accelerate the clinical adoption of virtual staining technologies.

Abstract

Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.

Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

TL;DR

This work tackles the gap in evaluating virtual immunohistochemistry by proposing an automated, accuracy-grounded framework that complements traditional image-fidelity metrics. It benchmarks sixteen diverse image-to-image translation models (both paired and unpaired) on tile- and whole-slide image data, employing color-deconvolution to generate real vs. virtual DAB masks and segmentation-based metrics (Dice, IoU, Hausdorff distance, TPR, TNR) to quantify staining accuracy. The study reveals that conventional fidelity metrics (FID, KID, PSNR, SSIM, MSE) poorly reflect staining accuracy and pathologist assessments, while segmentation-based metrics provide more interpretable, scalable measures of true staining. It also demonstrates that whole-slide evaluation uncovers tiling artifacts and boundary mislabeling not evident in patch-level analyses, underscoring the need for WSI-aware benchmarking. The proposed pipeline enables pathologist-free, automated evaluation and supports robust, high-throughput benchmarking to accelerate the clinical adoption of virtual staining technologies.

Abstract

Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.

Paper Structure

This paper contains 27 sections, 9 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Workflow to generate virtual IHC images and evaluate their quality.A. Paired H&E and IHC tiles extracted from the exact same tissue stained with H&E and restained with IHC are used to train Pix2Pix family models to generate virtual IHC images. B. Unpaired H&E and IHC tiles from different tissues stained with H&E and IHC are used to train cycle-GAN family or diffusion models. C. Evaluation of image quality utilizes standard image fidelity metrics, including manual, distribution-based and texture-based metrics. D. Stain accuracy metrics consist of segmentation metrics to determine if the correct pixels are colored in computer generated stains. Stain accuracy is determined on both image tiles and whole slide images (WSI). FID - Frechet Inception Distance, KID - Kernel Inception Distance, PSNR - Peak Signal-to-Noise Ratio, SSIM – Structural Similarity Index, MSE- Mean Square Error, DICE – DICE Similarity Coefficient, IoU – Intersection over Union.
  • Figure 2: Generation of virtual IHC images.A. Pix2Pix models predict which pixels in the H&E tile should be colored. During training, the discriminator decides whether the IHC image is real or a virtual/fake. When the discriminator can no longer distinguish between real and fake IHC, the algorithm completed its training. B. The cycle-GAN architecture uses unpaired image tiles. It includes two discriminator modules, one for real versus virtual IHC images and the other for real versus virtual H&E images. The consistency loss allows the model to learn from unpaired data. C. The diffusion model uses a GAN architecture to generate virtual IHC images. The Unpaired Neural Schrödinger Bridge (UNSB) model captures continuous, interpretable transitions between H&E and IHC domains. It scales to high-resolution biomedical images and supports incorporation of biological priors and regularization. D. Timeline of models for H&E to IHC image translation. E. Representative examples of generated IHC image tiles. The red arrow points to an area of understaining (false negative pixels) and the red box to an area of overstaining (false positive pixels).
  • Figure 3: Conventional metrics for evaluation of image quality.A. Metrics categories: feature distribution metrics evaluate features that are generated by encoders of real and virtual images. Texture metrics evaluate pixel-wise differences between paired real and virtual images. B. Hematoxylin and DAB feature coverage in real and virtual images. The hematoxylin and DAB channels of tiles are unmixed and passed through the same encoder. The area of solid color depicts the feature densities of virtual images while the dashed lines show the feature densities of real images. The image tiles on the side are added for qualitative comparisons of real and virtual images. C. Manual evaluation of image tiles generated by five models. The percentage of image tiles with good cell morphology, good tissue architecture, no blurring, good color fidelity and no hallucinations is shown. D. Comparison of FID scores and average PSNR scores. Models using unpaired input data are shown by triangles and models using paired inputs by circles. E. Comparison of FID scores and manual quality metrics.
  • Figure 4: Metrics for evaluation of staining accuracy.A. Workflow to determine staining accuracy. After digitization the H&E-stained slides, the tissue is restained with the CDX2 antibody and DAB as the chromogen. Alternatively, the digital H&E tiles are used to generate virtual CDX2-IHC tiles. Real and virtual IHC tiles are registered at pixel level accuracy. The brown color IHC stain in real and virtual IHC image tiles is converted to a binary DAB pixel mask using a trained model. After registration the DAB mask in the virtual tile is compared to the DAB mask in the real tile using the stain accuracy metrics of IoU, DICE and Hausdorff distance (HD). True positive (TPR) and true negative rates (TNR) are calculated in addition to false positive (FP) and false negative pixel rates. B. Tile-wise correlations between texture metrics (PNSR, SSIM, MSE) and stain accuracy metrics (IoU, DICE, HD, TPR, TNR) in Pix2Pix and CycleGAN models. C. Comparison of average DICE with FID scores. Models using unpaired input data are shown by triangles and models using paired input data by circles. D. Comparison of average DICE and average PSNR scores. Circles indicate the average scores across all the tiles in the dataset. Black lines show the relationship of patch-wise DICE and PSNR scores within each model. Note the negative regression slopes of tile-wise DICE and PSNR scores within each model in contrast to the strong positive correlation of average DICE and average PSNR between models. PSNR - Peak Signal-to-Noise Ratio, SSIM – Structural Similarity Index, MSE- Mean Square Error, DICE – DICE Similarity Coefficient, IoU – Intersection over Union, HD – Hausdorff Difference.
  • Figure 5: Stain accuracy evaluation in WSIs. A - C. Accuracy of gland segmentation in virtual IHC WSI. Comparison of automated gland masks in real and virtual IHC WSI to manual gland annotations. A trained algorithm is used to generate gland outlines from the IHC pixel masks. B. Gland outlines based on virtual IHC masks from the models listed in the first column are compared to manual IHC outlines using DICE, IoU, HD, TPR and TNR metrics. For comparison, the results of the real IHC gland outlines are shown in the bottom row. C. Qualitative evaluation of gland segmentation. In the left tile, the glands are outlines by a pathologist. Note the difference in false positive annotations in the lamina propria outside the glands in the real/virtual IHC images. D - F.Performance of model trained on H&E gland segmentations. The DAB pixel masks in virtual or real IHC images are transferred to the corresponding H&E image. Annotations in the H&E image are used to train gland segmentation models. E. The performance of the gland segmentation models in a held-out test set is compared to a model trained on transferred real IHC gland outlines. Metrics as in B. F. Qualitative segmentation results of gland outlines generated by models trained directly on manual H&E gland outlines, gland outlines transferred from real IHC images and gland outlines transferred from virtual IHC images. The models used to generate the virtual IHC images are listed above the image.
  • ...and 13 more figures