Table of Contents
Fetching ...

CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

Yang Yang, Xijie Xu, Yixun Zhou, Jie Zheng

TL;DR

This work addresses the challenge of accurate cell instance segmentation with Vision Transformers, whose patch-based downsampling hurts performance on small, densely packed cells. It introduces CellVTA, a CNN-based adapter that injects high-resolution spatial priors into ViT features through cross-attention, preserving the ViT backbone while enhancing fine-grained segmentation. Across CoNIC and PanNuke, CellVTA achieves state-of-the-art mPQ scores (0.538 and 0.506 respectively) and outperforms decoder-only and full fine-tuning baselines, underscoring the adapter's effectiveness. The results demonstrate the potential of pathology foundation models to benefit from targeted spatial priors in cell-level analysis, with the authors providing public code for reproducibility.

Abstract

Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.

CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

TL;DR

This work addresses the challenge of accurate cell instance segmentation with Vision Transformers, whose patch-based downsampling hurts performance on small, densely packed cells. It introduces CellVTA, a CNN-based adapter that injects high-resolution spatial priors into ViT features through cross-attention, preserving the ViT backbone while enhancing fine-grained segmentation. Across CoNIC and PanNuke, CellVTA achieves state-of-the-art mPQ scores (0.538 and 0.506 respectively) and outperforms decoder-only and full fine-tuning baselines, underscoring the adapter's effectiveness. The results demonstrate the potential of pathology foundation models to benefit from targeted spatial priors in cell-level analysis, with the authors providing public code for reproducibility.

Abstract

Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.

Paper Structure

This paper contains 10 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Training loss and validation mPQ of CellViT under different magnifications. We use 2$\times$ downsampling and upsampling to generate 10$\times$ and 40$\times$ input images.
  • Figure 2: Overall architecture of CellVTA. It comprises: (1) a ViT encoder, (2) an adapter module, and (3) a multi-branch decoder. First, the ViT encoder extracts features from an input image. Then the adapter module extracts multi-scale spatial information from the input image and injects them into the ViT encoder via feature interaction. The outputs of adapter are passed to the decoder via skip connections for cell segmentation.
  • Figure 3: Detailed architecture of the adapter module. The upper branch is the ViT encoder which is divided into $N$ ($N = 4$ in this paper) equal blocks for feature interaction. The lower branch is the adapter module consisting of (1) a spatial prior module to extract high-resolution spatial features from input images (2) a spatial feature injector to inject spatial priors into the ViT (3) a multi-scale feature extractor to extract hierarchical information from the ViT features.
  • Figure 4: Example of CoNIC and PanNuke patches with ground-truth annotations (left) and CellVTA predictions overlaid (right).