Table of Contents
Fetching ...

Evaluating New AI Cell Foundation Models on Challenging Kidney Pathology Cases Unaddressed by Previous Foundation Models

Runchen Wang, Junlin Guo, Siqi Lu, Ruining Deng, Zhengyi Lu, Yanfan Zhu, Yuechen Yang, Chongyu Qu, Yu Wang, Shilin Zhao, Catie Chang, Mitchell Wilkes, Mengmeng Yin, Haichun Yang, Yuankai Huo

TL;DR

This work benchmarks 2025-era AI cell foundation models (three CellViT++ variants and Cellpose-SAM) for nuclei instance segmentation in kidney pathology, using a human-in-the-loop rating on 2,091 challenging patches and a fusion ensemble to leverage model complementarities. It demonstrates that a fusion of these state-of-the-art models yields substantial gains over prior work, increasing Good segmentations on hard cases to 62.2% and nearly eliminating Bad outcomes, while also improving performance on the full dataset. Cross-model agreement analyses reveal meaningful diversity among the four models, supporting the ensemble approach. The study provides a curated hard-case dataset and quantitative evidence that ensemble strategies can robustly enhance renal histopathology analysis, informing future model refinement and deployment.

Abstract

Accurate cell nuclei segmentation is critical for downstream tasks in kidney pathology and remains a major challenge due to the morphological diversity and imaging variability of renal tissues. While our prior work has evaluated early-generation AI cell foundation models in this domain, the effectiveness of recent cell foundation models remains unclear. In this study, we benchmark advanced AI cell foundation models (2025), including CellViT++ variants and Cellpose-SAM, against three widely used cell foundation models developed prior to 2024, using a diverse large-scale set of kidney image patches within a human-in-the-loop rating framework. We further performed fusion-based ensemble evaluation and model agreement analysis to assess the segmentation capabilities of the different models. Our results show that CellViT++ [Virchow] yields the highest standalone performance with 40.3% of predictions rated as "Good" on a curated set of 2,091 challenging samples, outperforming all prior models. In addition, our fused model achieves 62.2% "Good" predictions and only 0.4% "Bad", substantially reducing segmentation errors. Notably, the fusion model (2025) successfully resolved the majority of challenging cases that remained unaddressed in our previous study. These findings demonstrate the potential of AI cell foundation model development in renal pathology and provide a curated dataset of challenging samples to support future kidney-specific model refinement.

Evaluating New AI Cell Foundation Models on Challenging Kidney Pathology Cases Unaddressed by Previous Foundation Models

TL;DR

This work benchmarks 2025-era AI cell foundation models (three CellViT++ variants and Cellpose-SAM) for nuclei instance segmentation in kidney pathology, using a human-in-the-loop rating on 2,091 challenging patches and a fusion ensemble to leverage model complementarities. It demonstrates that a fusion of these state-of-the-art models yields substantial gains over prior work, increasing Good segmentations on hard cases to 62.2% and nearly eliminating Bad outcomes, while also improving performance on the full dataset. Cross-model agreement analyses reveal meaningful diversity among the four models, supporting the ensemble approach. The study provides a curated hard-case dataset and quantitative evidence that ensemble strategies can robustly enhance renal histopathology analysis, informing future model refinement and deployment.

Abstract

Accurate cell nuclei segmentation is critical for downstream tasks in kidney pathology and remains a major challenge due to the morphological diversity and imaging variability of renal tissues. While our prior work has evaluated early-generation AI cell foundation models in this domain, the effectiveness of recent cell foundation models remains unclear. In this study, we benchmark advanced AI cell foundation models (2025), including CellViT++ variants and Cellpose-SAM, against three widely used cell foundation models developed prior to 2024, using a diverse large-scale set of kidney image patches within a human-in-the-loop rating framework. We further performed fusion-based ensemble evaluation and model agreement analysis to assess the segmentation capabilities of the different models. Our results show that CellViT++ [Virchow] yields the highest standalone performance with 40.3% of predictions rated as "Good" on a curated set of 2,091 challenging samples, outperforming all prior models. In addition, our fused model achieves 62.2% "Good" predictions and only 0.4% "Bad", substantially reducing segmentation errors. Notably, the fusion model (2025) successfully resolved the majority of challenging cases that remained unaddressed in our previous study. These findings demonstrate the potential of AI cell foundation model development in renal pathology and provide a curated dataset of challenging samples to support future kidney-specific model refinement.

Paper Structure

This paper contains 16 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overall framework. The previous study (Aug 2024) performed kidney cell nuclei instance segmentation using three cell foundation models: Cellpose cellpose, StarDist stardist, and CellViT cellvit. It revealed segmentation limitations in hard kidney pathology cases. These difficult cases were re-evaluated and categorized in Aug 2025 using four state-of-the-art models: three CellViT++ horst2025cellvit++ variants and Cellpose-SAM pachitariu2025cellpose. Model performance was evaluated based on human-in-the-loop rating for each prediction mask.
  • Figure 2: Illustration of new AI cell foundation models in this evaluation. Compared to our earlier work (before August 2024, shown in blue), the new models adopt vision transformer architectures for both encoder and decoder. These encoders (shown in distinct colors) incorporate large-scale pretrained foundation models such as HIPT chen2022scaling, Virchow vorontsov2024virchow, SAM kirillov2023segany.
  • Figure 3: Rating criteria for this assessment. Illustrative details of the different rating category examples are provided.
  • Figure 4: Distribution of rated predictions from AI cell foundation models across the evaluation dataset. Each row represents the foundation model’s predictions, with three values corresponding to the number of predictions rated as “Good,” “Medium,” and “Bad,” respectively. Then, data enrichment (shown as “Fused” Model) was performed based on the evaluation results of individual models, resulting in an increase in “Good” image patches and a decrease in “Bad” image patches.
  • Figure 5: Cross-model performance rating agreement.(a) shows the agreement percentages between each pair of foundation models used in this study. To further assess the cross-model performance, (b) shows the percentages of image patches where all four models agree, three agree, two agree, or no agreement, for each rating category (“Good”, “Medium”, “Bad”).
  • ...and 2 more figures