Table of Contents
Fetching ...

From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology

Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Junchao Zhu, Haibo Wang, Daniel Reisenbüchler, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Steven Salvatoree, Surya Seshane, Mert R. Sabuncu, Yihe Yang, Ruining Deng

TL;DR

This work addresses the challenge of fine-grained glomerular subtyping under scarce labeled data by evaluating pathology-specialized and general-purpose vision-language models (VLMs) across 0–32 shots with four fine-tuning strategies. The authors introduce a few-shot framework and embedding-space analyses to study image-text alignment and class separability, revealing that pathology-specialized backbones paired with vanilla fine-tuning offer strong gains in low-shot regimes, while increased supervision further improves performance and calibration. They demonstrate that alignment alone is insufficient to guarantee accuracy, underscoring the importance of how adaptation reshapes decision boundaries and positive/negative pair structure. The study provides practical guidance for model selection, annotation budgeting, and interpretable multimodal learning in renal pathology, supported by novel alignment and separability metrics that correlate with downstream performance.

Abstract

Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.

From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology

TL;DR

This work addresses the challenge of fine-grained glomerular subtyping under scarce labeled data by evaluating pathology-specialized and general-purpose vision-language models (VLMs) across 0–32 shots with four fine-tuning strategies. The authors introduce a few-shot framework and embedding-space analyses to study image-text alignment and class separability, revealing that pathology-specialized backbones paired with vanilla fine-tuning offer strong gains in low-shot regimes, while increased supervision further improves performance and calibration. They demonstrate that alignment alone is insufficient to guarantee accuracy, underscoring the importance of how adaptation reshapes decision boundaries and positive/negative pair structure. The study provides practical guidance for model selection, annotation budgeting, and interpretable multimodal learning in renal pathology, supported by novel alignment and separability metrics that correlate with downstream performance.

Abstract

Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.

Paper Structure

This paper contains 19 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Pipeline overview. We train the image encoder and the text encoder in a few-shot setting, where only a small number of labels are provided for each glomerular subtype. For each shot, patches and their class descriptions are encoded, and the model is updated. As the number of shots increases, the performance improves.
  • Figure 2: (a) Model architectures. We compare contrastive vision-language models with separate image and text encoders trained via pairwise cosine similarity, and a CoCa-style architecture with an additional fusion/decoder branch. (b) Domain-specific training data. We illustrate three sources of paired image-text data: natural images with generic captions, social media posts from Twitter with informal expert commentary, and biomedical figures from PubMed with pathology-style captions. (c) Fine-tuning strategies. We consider (i) vanilla fine-tuning, where all model parameters are updated; (ii) LoRA, which inserts low-rank trainable updates into frozen weight matrices; (iii) adapter tuning, which adds small trainable adapter modules between frozen layers; and (iv) classifier tuning, which keeps the backbone frozen and only trains a classifier head.
  • Figure 3: Complementary roles of alignment and similarity gap. It shows the change in image and text feature vectors before and after fine-tuning. Alignment measures whether fine-tuning pulls an image embedding closer to the embedding of its own positive text description. The similarity gap measures whether fine-tuning pushes an image embedding farther away from text embeddings of other (negative) classes. Together, these two effects capture how fine-tuning reduces within-class image-text distance while enlarging across-class distance..
  • Figure 4: ROC curves across models and fine-tuning methods as shot counts increase. We compare two axes, model comparison (fixed strategy) and fine-tuning method comparison (fixed backbone), by plotting true positive rate (TPR) versus false positive rate (FPR) across increasing shot counts.
  • Figure 5: Discrimination across subtypes. This figure summarizes per-class discrimination performance across 10 runs. For each glomerular subtype, we report the distribution of AUC for different VLMs, fine-tuning strategies, and numbers of shots.
  • ...and 2 more figures