CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

Yang Yang; Xijie Xu; Yixun Zhou; Jie Zheng

CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

Yang Yang, Xijie Xu, Yixun Zhou, Jie Zheng

TL;DR

This work addresses the challenge of accurate cell instance segmentation with Vision Transformers, whose patch-based downsampling hurts performance on small, densely packed cells. It introduces CellVTA, a CNN-based adapter that injects high-resolution spatial priors into ViT features through cross-attention, preserving the ViT backbone while enhancing fine-grained segmentation. Across CoNIC and PanNuke, CellVTA achieves state-of-the-art mPQ scores (0.538 and 0.506 respectively) and outperforms decoder-only and full fine-tuning baselines, underscoring the adapter's effectiveness. The results demonstrate the potential of pathology foundation models to benefit from targeted spatial priors in cell-level analysis, with the authors providing public code for reproducibility.

Abstract

Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at https://github.com/JieZheng-ShanghaiTech/CellVTA.

CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

TL;DR

Abstract

CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)