Domain Adaptation using Silver Standard Labels for Ki-67 Scoring in Digital Pathology: A Step Closer to Widescale Deployment
Amanda Dy, Ngoc-Nhu Jennifer Nguyen, Seyed Hossein Mirjahanmardi, Melanie Dawe, Anthony Fyles, Wei Shi, Fei-Fei Liu, Dimitrios Androutsos, Susan Done, April Khademi
TL;DR
This work tackles the challenge of domain shift in automated Ki-67 proliferation index scoring for digital pathology by introducing an unsupervised domain adaptation pipeline that generates silver-standard (pseudo) labels in the target domain. The method pre-trains two Ki-67 quantification architectures, UV-Net and piNET, on target-domain SS data and then fine-tunes on source-domain gold-standard data (SS+GS), achieving the best target performance and consistency. Across source and target domains, the SS+GS configuration yields higher PI accuracy (mean around $95.9\%$ for piNET) and substantially lower $\Delta PI$ errors (reduced from about $7.5\%$ to near $4\%$), with t-SNE evidence of reduced domain gap. This per-site calibration approach enables closer-to-widescale deployment of automated Ki-67 scoring tools by avoiding manual labeling at every new site while maintaining robust accuracy.
Abstract
Deep learning systems have been proposed to improve the objectivity and efficiency of Ki- 67 PI scoring. The challenge is that while very accurate, deep learning techniques suffer from reduced performance when applied to out-of-domain data. This is a critical challenge for clinical translation, as models are typically trained using data available to the vendor, which is not from the target domain. To address this challenge, this study proposes a domain adaptation pipeline that employs an unsupervised framework to generate silver standard (pseudo) labels in the target domain, which is used to augment the gold standard (GS) source domain data. Five training regimes were tested on two validated Ki-67 scoring architectures (UV-Net and piNET), (1) SS Only: trained on target silver standard (SS) labels, (2) GS Only: trained on source GS labels, (3) Mixed: trained on target SS and source GS labels, (4) GS+SS: trained on source GS labels and fine-tuned on target SS labels, and our proposed method (5) SS+GS: trained on source SS labels and fine-tuned on source GS labels. The SS+GS method yielded significantly (p < 0.05) higher PI accuracy (95.9%) and more consistent results compared to the GS Only model on target data. Analysis of t-SNE plots showed features learned by the SS+GS models are more aligned for source and target data, resulting in improved generalization. The proposed pipeline provides an efficient method for learning the target distribution without manual annotations, which are time-consuming and costly to generate for medical images. This framework can be applied to any target site as a per-laboratory calibration method, for widescale deployment.
