iSight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

Jacob S. Leiby; Jialu Yao; Pan Lu; George Hu; Anna Davidian; Shunsuke Koga; Olivia Leung; Pravin Patel; Isabella Tondi Resta; Rebecca Rojansky; Derek Sung; Eric Yang; Paul J. Zhang; Emma Lundberg; Dokyoon Kim; Serena Yeung-Levy; James Zou; Thomas Montine; Jeffrey Nirschl; Zhi Huang

iSight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

Jacob S. Leiby, Jialu Yao, Pan Lu, George Hu, Anna Davidian, Shunsuke Koga, Olivia Leung, Pravin Patel, Isabella Tondi Resta, Rebecca Rojansky, Derek Sung, Eric Yang, Paul J. Zhang, Emma Lundberg, Dokyoon Kim, Serena Yeung-Levy, James Zou, Thomas Montine, Jeffrey Nirschl, Zhi Huang

TL;DR

iSight, a multi-task learning framework for automated IHC staining assessment that combines visual features from whole-slide images with tissue metadata through a token-level attention mechanism, outperforms fine-tuned foundation models and demonstrates well-calibrated predictions with expected calibration errors.

Abstract

Immunohistochemistry (IHC) provides information on protein expression in tissue sections and is commonly used to support pathology diagnosis and disease triage. While AI models for H\&E-stained slides show promise, their applicability to IHC is limited due to domain-specific variations. Here we introduce HPA10M, a dataset that contains 10,495,672 IHC images from the Human Protein Atlas with comprehensive metadata included, and encompasses 45 normal tissue types and 20 major cancer types. Based on HPA10M, we trained iSight, a multi-task learning framework for automated IHC staining assessment. iSight combines visual features from whole-slide images with tissue metadata through a token-level attention mechanism, simultaneously predicting staining intensity, location, quantity, tissue type, and malignancy status. On held-out data, iSight achieved 85.5\% accuracy for location, 76.6\% for intensity, and 75.7\% for quantity, outperforming fine-tuned foundation models (PLIP, CONCH) by 2.5--10.2\%. In addition, iSight demonstrates well-calibrated predictions with expected calibration errors of 0.0150-0.0408. Furthermore, in a user study with eight pathologists evaluating 200 images from two datasets, iSight outperformed initial pathologist assessments on the held-out HPA dataset (79\% vs 68\% for location, 70\% vs 57\% for intensity, 68\% vs 52\% for quantity). Inter-pathologist agreement also improved after AI assistance in both held-out HPA (Cohen's $κ$ increased from 0.63 to 0.70) and Stanford TMAD datasets (from 0.74 to 0.76), suggesting expert--AI co-assessment can improve IHC interpretation. This work establishes a foundation for AI systems that can improve IHC diagnostic accuracy and highlights the potential for integrating iSight into clinical workflows to enhance the consistency and reliability of IHC assessment.

iSight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

TL;DR

Abstract

increased from 0.63 to 0.70) and Stanford TMAD datasets (from 0.74 to 0.76), suggesting expert--AI co-assessment can improve IHC interpretation. This work establishes a foundation for AI systems that can improve IHC diagnostic accuracy and highlights the potential for integrating iSight into clinical workflows to enhance the consistency and reliability of IHC assessment.

Paper Structure (29 sections, 7 figures)

This paper contains 29 sections, 7 figures.

Introduction
Results
Discussion
Methods
Acknowledgements
Ethics declarations
Author contributions statement
Data Availability
Code Availability

Figures (7)

Figure 1: HPA10M dataset and iSight model architecture for multi-task immunohistochemistry analysis.a, Dataset construction workflow from the Human Protein Atlas. b, Distribution of 45 normal tissue types in HPA. c, Distribution of 20 major cancer types in HPA10M. d, The model processes whole slide images by dividing them into 336×336 patches, extracting visual features with a Vision Transformer (CLIP-ViT-large-patch-14-336), and aggregating patch-level representations using gated attention-based multiple instance learning (MIL). Text metadata, including tissue type, SNOMED diagnosis, and antibody information, is encoded separately using the CLIP text encoder. e,The multi-task learning framework uses five parallel classification heads to simultaneously predict staining properties (location, intensity, quantity) and tissue characteristics (tissue type, malignancy status). Image credit: Human Protein Atlas; (http://v23.proteinatlas.org/ENSG00000170312-CDK1/)
Figure 2: Model comparison and performance analysis on held-out HPA test set.a, Performance comparison between iSight, fine-tuned PLIP, and fine-tuned CONCH on three primary staining tasks: location, intensity, and quantity. b, Performance comparison on two auxiliary tasks: tissue type classification and malignancy detection. c-e, Confusion matrices for iSight showing classification patterns for staining intensity, location, and quantity respectively. f-h, Calibration curves for iSight showing the relationship between predicted confidence and actual accuracy for staining location, intensity, and quantity tasks. Each curve displays Expected Calibration Error (ECE) values with 95% confidence intervals. i, Within-one-rank accuracy and beyond-one-rank error rates for ordinal classification tasks (staining intensity and quantity) comparing iSight with baseline models PLIP and CONCH. Error bars represent 95% confidence intervals derived from bootstrap resampling. Paired two-tailed Student’s t-tests were used to evaluate statistical significance (*p < 0.05, **p < 0.01, ***p < 0.001).
Figure 3: Pathologist user study across HPA and Stanford datasets.a, iSight pathologist user study workflow. b, Web-based annotation interface for pathologist user study. c, Comparison of classification performance against ground truth in the HPA dataset for AI predictions, pathologist initial annotations, and pathologist annotations after AI suggestion across three staining tasks. d-e, Inter-pathologist agreement measured by average Cohen's $\kappa$ before and after AI suggestions for each feature in HPA and Stanford datasets. Error bars represent 95% confidence intervals derived from bootstrap resampling. Paired two-tailed Student’s t-tests were used to evaluate statistical significance (*p < 0.05, **p < 0.01, ***p < 0.001).
Figure 4: AI influence analysis across HPA and Stanford datasets.a-c, AI influence analysis in the HPA dataset, which shows the distribution of pathologist responses when their initial annotation disagreed with AI predictions for location, intensity, and quantity tasks respectively: adopted AI suggestion (dark blue), changed to alternative label (medium blue), or maintained original annotation (light blue). d-f, AI influence analysis in the Stanford dataset for location, intensity, and quantity tasks.
Figure S1: Monthly IHC Case Volume at Stanford Healthcare from 2005 to 2025. The gray line represents the raw monthly IHC case counts, while the blue line shows the LOESS smoothed trend. The data demonstrates a consistent upward trajectory in IHC utilization over the 20-year period, with notable fluctuations during the COVID-19 pandemic (marked in red text) starting in April 2020. The analysis includes 245 months of data, showing an increase from approximately 700 cases per month in 2005 to over 3,500 cases per month by 2025.
...and 2 more figures

iSight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

TL;DR

Abstract

iSight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)