Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

Qing Zhang; Hang Guo; Siyuan Yang; Qingli Li; Yan Wang

Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

Qing Zhang, Hang Guo, Siyuan Yang, Qingli Li, Yan Wang

TL;DR

This work presents MONCH, a single-branch network for multi-organ, multi-class cell semantic segmentation that fuses textual cell attributes with multi-grained visual features via a Progressive Vision-Language Prompt Decoder. The method uses a coarse-to-fine visual feature extractor (MGFE) and a sequence of attention-driven prompts to integrate fine-grained textures, topological cues, and textual priors, enabling robust segmentation on imbalanced PanNuke data. Empirical results show MONCH achieves state-of-the-art IoU, FWIoU, and F1 scores across multiple organ types, while maintaining efficiency relative to multi-branch approaches. The approach demonstrates the practical potential of vision-language fusion and multi-scale, multi-modal feature integration for pathology-level segmentation tasks.

Abstract

Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi-organ, multi-class cell segmentation is particularly challenging. Most existing methods employ multi-branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi-class analysis due to intricate textural details. To address these challenges, we propose a Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision-language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse-to-fine-grained features for segmenting cells of various shapes, including high-frequency, convolutional, and topological features. Inspired by the synergy of textual and multi-grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state-of-the-art cell segmentation methods and vision-language models. Codes and implementations will be made publicly available.

Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

TL;DR

Abstract

Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)