Table of Contents
Fetching ...

How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

Yanfan Zhu, Juming Xiong, Ruining Deng, Yu Wang, Yaohong Wang, Shilin Zhao, Mengmeng Yin, Yuqing Liu, Haichun Yang, Yuankai Huo

TL;DR

This paper investigates how close AI models can come to replicating Banff lesion scoring in renal allograft pathology by decomposing Banff indicators into structural and inflammatory components and evaluating existing AI tools through a modular, rule-based framework. It builds a pipeline that combines tissue-section detection, structural segmentation (Omni-Seg), and inflammatory cell detection, then maps outputs to Banff scores for g, ptc, and v using defined thresholds. The authors reveal partial success but identify critical failure modes such as structural omissions, hallucinations, detection ambiguity, and interpretability gaps where correct final scores may not reflect robust intermediate reasoning. Overall, the work highlights fundamental challenges in fully replacing expert-grade Banff scoring with current AI methods and offers a modular evaluation framework to guide future development and standardization in transplant pathology.

Abstract

The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.

How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

TL;DR

This paper investigates how close AI models can come to replicating Banff lesion scoring in renal allograft pathology by decomposing Banff indicators into structural and inflammatory components and evaluating existing AI tools through a modular, rule-based framework. It builds a pipeline that combines tissue-section detection, structural segmentation (Omni-Seg), and inflammatory cell detection, then maps outputs to Banff scores for g, ptc, and v using defined thresholds. The authors reveal partial success but identify critical failure modes such as structural omissions, hallucinations, detection ambiguity, and interpretability gaps where correct final scores may not reflect robust intermediate reasoning. Overall, the work highlights fundamental challenges in fully replacing expert-grade Banff scoring with current AI methods and offers a modular evaluation framework to guide future development and standardization in transplant pathology.

Abstract

The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.

Paper Structure

This paper contains 11 sections, 9 equations, 5 figures.

Figures (5)

  • Figure 1: Component-level decomposition of Banff lesion scores and model support status. Each Banff indicator (e.g., g, v, ptc) is represented as a combination of required cellular and structural features. The left panel shows visual examples of representative tissue components segmented by existing models. The right panel maps each Banff score to its constituent elements and indicates whether current models in our pipeline can detect them: ✓ indicates sufficient support, ✗ indicates missing capability, and ? indicates partial or uncertain feasibility. This mapping reveals which Banff scores are currently tractable using existing tools and highlights unresolved challenges, especially for indicators like i, ci, ct, and cg.
  • Figure 2: Visual workflow for Banff lesion score computation. Each row illustrates the process of combining structural segmentation and inflammatory cell detection to derive the Banff score for (a) glomerulitis, (b) peritubular capillaritis, and (c) intimal arteritis. The final scores are assigned based on the corresponding rules and thresholds defined in Eqs. \ref{['eq:g-score']}, \ref{['eq:ptc-score']}, and \ref{['eq:v-score']}.
  • Figure 3: Confusion matrices for model-predicted Banff scores versus ground truth across three lesion types: (a) glomerulitis ($g$), (b) peritubular capillaritis ($ptc$), and (c) intimal arteritis ($v$). Color intensity reflects prediction counts, with perfect agreement on the diagonal.
  • Figure 4: Representative structural segmentation errors affecting Banff lesion scoring. (a) Omission errors cause false negatives in ptc, while (b) hallucinations cause false positives in v. Both highlight how segmentation inaccuracies propagate to downstream Banff scores.
  • Figure 5: A representative case where the predicted glomerulitis score ($g=1$) matches the ground truth, yet the inflammatory cell detection results remain ambiguous. Left: model-predicted structural segmentation and inflammatory cells overlaid on the tissue section. Middle: zoomed-in glomerulus with annotated true positives (TP), false positives (FP), and false negatives (FN). Right: expert-annotated ground truth image. Although the overall score is correct, the segmentation and localization of inflammatory cells are not precisely aligned with the expert reference. This illustrates the inherent difficulty of reliably computing semi-quantitative Banff scores when intermediate representations—such as inflammatory cell maps—are themselves uncertain or ill-defined.