Table of Contents
Fetching ...

Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment

Junlin Guo, Siqi Lu, Can Cui, Ruining Deng, Tianyuan Yao, Zhewen Tao, Yizhe Lin, Marilyn Lionts, Quan Liu, Juming Xiong, Yu Wang, Shilin Zhao, Catie Chang, Mitchell Wilkes, Mengmeng Yin, Haichun Yang, Yuankai Huo

TL;DR

This study interrogates the readiness of three cell foundation models (Cellpose, StarDist, CellViT) for kidney nuclei segmentation using a large, diverse dataset of 2,542 kidney WSIs. It introduces a human-in-the-loop data enrichment framework that fuses predictions from multiple models with limited expert corrections to generate enriched training data, enabling continual fine-tuning that improves performance across all models. Notably, StarDist achieves an $F1$ score of $0.8229$ after fine-tuning, while CellViT remains strong with $F1$ up to $0.7952$ when using combined easy and hard labels, illustrating that organ-targeted fine-tuning and HITL can substantially boost nuclei segmentation in histology. The findings highlight persistent gaps between general foundation models and organ-specific tasks, and propose a practical, low-labeling-cost path toward robust, real-world kidney pathology workflows through multi-model data curation and targeted fine-tuning.

Abstract

Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, "How good are we?", by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, "How can we improve?", by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications.

Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment

TL;DR

This study interrogates the readiness of three cell foundation models (Cellpose, StarDist, CellViT) for kidney nuclei segmentation using a large, diverse dataset of 2,542 kidney WSIs. It introduces a human-in-the-loop data enrichment framework that fuses predictions from multiple models with limited expert corrections to generate enriched training data, enabling continual fine-tuning that improves performance across all models. Notably, StarDist achieves an score of after fine-tuning, while CellViT remains strong with up to when using combined easy and hard labels, illustrating that organ-targeted fine-tuning and HITL can substantially boost nuclei segmentation in histology. The findings highlight persistent gaps between general foundation models and organ-specific tasks, and propose a practical, low-labeling-cost path toward robust, real-world kidney pathology workflows through multi-model data curation and targeted fine-tuning.

Abstract

Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, "How good are we?", by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, "How can we improve?", by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications.

Paper Structure

This paper contains 24 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overall framework. The upper panel illustrates the diverse evaluation dataset consisting of 2,542 kidney WSIs. Performance: Kidney cell nuclei instance segmentation was performed using three SOTA cell foundation models: Cellpose, StarDist, and CellViT. Model performance was evaluated based on qualitative human feedback for each prediction mask. Data Enrichment: A human-in-the-loop (HITL) design integrates prediction masks from performance evaluation into the model’s continual learning process, reducing reliance on pixel-level human annotation.
  • Figure 2: Human-in-the-loop (HITL) Data Enrichment Design. The upper panel shows the inference and curation of prediction masks from three foundation models (Cellpose, StarDist, and CellViT) in this study. (a) First, kidney nuclei instance segmentation was performed on the evaluation dataset using three cell foundation models. (b) Model performance was evaluated by rating each prediction mask as "good," "medium," or "bad" according to criteria from a renal pathologist. "Good" predictions captured approximately 90% of the nuclei in a patch, "bad" predictions captured less than 50%, and the rest were classified as "medium." We used this rating system to both qualitatively and quantitatively evaluate and categorize each model’s predictions within our dataset. (c-e) The lower panel illustrates the data enrichment strategy that utilizes these curation outcomes to enhance model performance through continuous fine-tuning. Specifically, to enrich the training dataset while minimizing pixel-level annotation, we used both pseudo-labeled images from multiple foundation models (termed as "easy") and “hard” samples that all models failed.
  • Figure 3: Illustrations of Rating Criteria. This Figure illustrates the rating criteria, showing examples and their ratings for "good”, "medium”, and "bad” categories.
  • Figure 4: Distribution of Rated Predictions from Cell Foundation Models Across the Evaluation Dataset. Each row represents the foundation model's predictions, with three values corresponding to the number of predictions rated as "good," "medium," and "bad," respectively. Then, data enrichment (shown as "Fused" Model) was performed based on the evaluation results of individual models, resulting in an increase in "good" image patches and a decrease in "bad" image patches. Lastly, we summarized a taxonomy of "bad" image patches that all foundation models failed.
  • Figure 5: Cross-Model Performance Agreement. (a) shows the agreement matrix between each pair of foundation models. To further assess the cross-model performance, (b) shows the percentages of image patches where all three models agree, two models agree, or no models agree, for each prediction class (“good”, “medium”, “bad”).
  • ...and 3 more figures