Table of Contents
Fetching ...

CellViT: Vision Transformers for Precise Cell Segmentation and Classification

Fabian Hörst, Moritz Rempe, Lukas Heine, Constantin Seibold, Julius Keyl, Giulia Baldini, Selma Ugurel, Jens Siveke, Barbara Grünwald, Jan Egger, Jens Kleesiek

TL;DR

CellViT introduces a ViT/encoder-based nucleus segmentation framework that unifies detection, segmentation, and classification within a U-Net–like architecture. By leveraging large-scale histology-pretrained ViTs (ViT256) and SAM as backbones, and by employing multi-task losses and HoVer-Net–style postprocessing, it achieves state-of-the-art nuclei detection and competitive segmentation on PanNuke, while generalizing to MoNuSeg without finetuning. The model also yields informative nucleus embeddings that can support downstream predictive tasks, and its 1024×1024 patch inference substantially speeds up WSI processing compared with traditional CNN-based methods. Overall, CellViT demonstrates the value of transformer-based backbones for precise cell-level analysis in digital pathology and enables efficient, scalable analysis of gigapixel WSIs.

Abstract

Nuclei detection and segmentation in hematoxylin and eosin-stained (H&E) tissue images are important clinical tasks and crucial for a wide range of applications. However, it is a challenging task due to nuclei variances in staining and size, overlapping boundaries, and nuclei clustering. While convolutional neural networks have been extensively used for this task, we explore the potential of Transformer-based networks in this domain. Therefore, we introduce a new method for automated instance segmentation of cell nuclei in digitized tissue samples using a deep learning architecture based on Vision Transformer called CellViT. CellViT is trained and evaluated on the PanNuke dataset, which is one of the most challenging nuclei instance segmentation datasets, consisting of nearly 200,000 annotated Nuclei into 5 clinically important classes in 19 tissue types. We demonstrate the superiority of large-scale in-domain and out-of-domain pre-trained Vision Transformers by leveraging the recently published Segment Anything Model and a ViT-encoder pre-trained on 104 million histological image patches - achieving state-of-the-art nuclei detection and instance segmentation performance on the PanNuke dataset with a mean panoptic quality of 0.50 and an F1-detection score of 0.83. The code is publicly available at https://github.com/TIO-IKIM/CellViT

CellViT: Vision Transformers for Precise Cell Segmentation and Classification

TL;DR

CellViT introduces a ViT/encoder-based nucleus segmentation framework that unifies detection, segmentation, and classification within a U-Net–like architecture. By leveraging large-scale histology-pretrained ViTs (ViT256) and SAM as backbones, and by employing multi-task losses and HoVer-Net–style postprocessing, it achieves state-of-the-art nuclei detection and competitive segmentation on PanNuke, while generalizing to MoNuSeg without finetuning. The model also yields informative nucleus embeddings that can support downstream predictive tasks, and its 1024×1024 patch inference substantially speeds up WSI processing compared with traditional CNN-based methods. Overall, CellViT demonstrates the value of transformer-based backbones for precise cell-level analysis in digital pathology and enables efficient, scalable analysis of gigapixel WSIs.

Abstract

Nuclei detection and segmentation in hematoxylin and eosin-stained (H&E) tissue images are important clinical tasks and crucial for a wide range of applications. However, it is a challenging task due to nuclei variances in staining and size, overlapping boundaries, and nuclei clustering. While convolutional neural networks have been extensively used for this task, we explore the potential of Transformer-based networks in this domain. Therefore, we introduce a new method for automated instance segmentation of cell nuclei in digitized tissue samples using a deep learning architecture based on Vision Transformer called CellViT. CellViT is trained and evaluated on the PanNuke dataset, which is one of the most challenging nuclei instance segmentation datasets, consisting of nearly 200,000 annotated Nuclei into 5 clinically important classes in 19 tissue types. We demonstrate the superiority of large-scale in-domain and out-of-domain pre-trained Vision Transformers by leveraging the recently published Segment Anything Model and a ViT-encoder pre-trained on 104 million histological image patches - achieving state-of-the-art nuclei detection and instance segmentation performance on the PanNuke dataset with a mean panoptic quality of 0.50 and an F1-detection score of 0.83. The code is publicly available at https://github.com/TIO-IKIM/CellViT
Paper Structure (36 sections, 14 equations, 7 figures, 8 tables)

This paper contains 36 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Network structure of CellViT. An input image is transformed into a sequence of tokens (flattened input sections). By using skip connections at multiple encoder depth levels and a dedicated upsampling decoder network, precise nuclei instance segmentations are derived. Nuclei embeddings are extracted from the Transformer encoder.
  • Figure 2: Network structure of our proposed CellViT-network consisting of a ViT encoder connected to multiple decoders via skip connections. Postprocessing is used to separate overlapping nuclei and perform nuclei type classification. For visualization purposes, the tissue classification branch is not illustrated. As encoder networks, we used the pre-trained $\text{ViT}_{256}$ and SAM models.
  • Figure 3: PanNuke nuclei distribution overview for each of the nineteen tissue types, sorted by the total number of nuclei inside the tissue. The total number of nuclei within a tissue type is given in parentheses. Adapted from pannuke.
  • Figure 4: Example of PanNuke patches with ground-truth annotations and CellViT-SAM-H predictions overlaid for each tissue type.
  • Figure 5: Two-dimensional UMAP embedding visualization (left) of the CoNSeP dataset with the CellViT-SAM-H and $\text{CellViT}_{256}$ (HoVer-Net encoder) models trained on PanNuke. We extract cell-tokens for each detected cell with our model, resulting in one embedding vector per cell. On the right side of the figure, representative clusters derived with the CellViT-SAM-H model are displayed alongside corresponding tissue images. The color overlay illustrates the ground-truth nuclei types within the dataset.
  • ...and 2 more figures