Table of Contents
Fetching ...

Mind the Gap: Evaluating Patch Embeddings from General-Purpose and Histopathology Foundation Models for Cell Segmentation and Classification

Valentina Vadori, Antonella Peruffo, Jean-Marie Graïc, Livio Finos, Enrico Grisan

TL;DR

The paper investigates whether domain-specific histopathology foundation models offer advantages over general-purpose models for cell instance segmentation and cell-type classification within a CISCA encoder–decoder framework. It compares a suite of encoders, including CNNs, ViTs, and hybrid architectures pretrained on ImageNet-22K/21K or LVD-142M against histopathology-focused ViTs (UNI2, Virchow2, Prov-GigaPath) across PanNuke, CoNIC, and CytoDArk0 datasets, without fine-tuning. Key findings show that general-purpose non-ViT encoders (notably Swin Transformer V2 and ConvNeXt) often surpass histopathology ViTs in CS and CC tasks, suggesting that inductive biases like locality and hierarchy are crucial for accurate cell delineation and classification. The results inform model selection for histopathology and brain cytoarchitecture analyses and point to future work on more comprehensive ViT configurations and training strategies to close the representation gap.

Abstract

Recent advancements in foundation models have transformed computer vision, driving significant performance improvements across diverse domains, including digital histopathology. However, the advantages of domain-specific histopathology foundation models over general-purpose models for specialized tasks such as cell analysis remain underexplored. This study investigates the representation learning gap between these two categories by analyzing multi-level patch embeddings applied to cell instance segmentation and classification. We implement an encoder-decoder architecture with a consistent decoder and various encoders. These include convolutional, vision transformer (ViT), and hybrid encoders pre-trained on ImageNet-22K or LVD-142M, representing general-purpose foundation models. These are compared against ViT encoders from the recently released UNI, Virchow2, and Prov-GigaPath foundation models, trained on patches extracted from hundreds of thousands of histopathology whole-slide images. The decoder integrates patch embeddings from different encoder depths via skip connections to generate semantic and distance maps. These maps are then post-processed to create instance segmentation masks where each label corresponds to an individual cell and to perform cell-type classification. All encoders remain frozen during training to assess their pre-trained feature extraction capabilities. Using the PanNuke and CoNIC histopathology datasets, and the newly introduced Nissl-stained CytoDArk0 dataset for brain cytoarchitecture studies, we evaluate instance-level detection, segmentation accuracy, and cell-type classification. This study provides insights into the comparative strengths and limitations of general-purpose vs. histopathology foundation models, offering guidance for model selection in cell-focused histopathology and brain cytoarchitecture analysis workflows.

Mind the Gap: Evaluating Patch Embeddings from General-Purpose and Histopathology Foundation Models for Cell Segmentation and Classification

TL;DR

The paper investigates whether domain-specific histopathology foundation models offer advantages over general-purpose models for cell instance segmentation and cell-type classification within a CISCA encoder–decoder framework. It compares a suite of encoders, including CNNs, ViTs, and hybrid architectures pretrained on ImageNet-22K/21K or LVD-142M against histopathology-focused ViTs (UNI2, Virchow2, Prov-GigaPath) across PanNuke, CoNIC, and CytoDArk0 datasets, without fine-tuning. Key findings show that general-purpose non-ViT encoders (notably Swin Transformer V2 and ConvNeXt) often surpass histopathology ViTs in CS and CC tasks, suggesting that inductive biases like locality and hierarchy are crucial for accurate cell delineation and classification. The results inform model selection for histopathology and brain cytoarchitecture analyses and point to future work on more comprehensive ViT configurations and training strategies to close the representation gap.

Abstract

Recent advancements in foundation models have transformed computer vision, driving significant performance improvements across diverse domains, including digital histopathology. However, the advantages of domain-specific histopathology foundation models over general-purpose models for specialized tasks such as cell analysis remain underexplored. This study investigates the representation learning gap between these two categories by analyzing multi-level patch embeddings applied to cell instance segmentation and classification. We implement an encoder-decoder architecture with a consistent decoder and various encoders. These include convolutional, vision transformer (ViT), and hybrid encoders pre-trained on ImageNet-22K or LVD-142M, representing general-purpose foundation models. These are compared against ViT encoders from the recently released UNI, Virchow2, and Prov-GigaPath foundation models, trained on patches extracted from hundreds of thousands of histopathology whole-slide images. The decoder integrates patch embeddings from different encoder depths via skip connections to generate semantic and distance maps. These maps are then post-processed to create instance segmentation masks where each label corresponds to an individual cell and to perform cell-type classification. All encoders remain frozen during training to assess their pre-trained feature extraction capabilities. Using the PanNuke and CoNIC histopathology datasets, and the newly introduced Nissl-stained CytoDArk0 dataset for brain cytoarchitecture studies, we evaluate instance-level detection, segmentation accuracy, and cell-type classification. This study provides insights into the comparative strengths and limitations of general-purpose vs. histopathology foundation models, offering guidance for model selection in cell-focused histopathology and brain cytoarchitecture analysis workflows.

Paper Structure

This paper contains 13 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A schematic representation of the encoder-decoder architecture utilized in this study. Feature maps are extracted from four encoder blocks and processed through convolution and upsampling operations to achieve predefined channel dimensions and resolutions. The decoder, consisting of upsampling and convolutional blocks, is designed to generate four distinct outputs based on the CISCA framework vadori2024cisca. These outputs are further post-processed to produce the final label map and to assign a specific cell type to each detected cell.
  • Figure 2: Loss curves over 100 epochs for different models trained on PanNuke. Dotted lines and and solid lines represent training and validation loss, respectively. The lowest validation loss for each model (whose name is inherited from the encoder used) is marked for reference.
  • Figure 3: Example of cell instance segmentation and classification on two test patches from PanNuke using Swin2-B-22K.
  • Figure 4: Example of cell instance segmentation on two test patches from CytoDArk0 using ConvNeXt-B-22K.