Table of Contents
Fetching ...

Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

Qing Zhang, Hang Guo, Siyuan Yang, Qingli Li, Yan Wang

TL;DR

This work presents MONCH, a single-branch network for multi-organ, multi-class cell semantic segmentation that fuses textual cell attributes with multi-grained visual features via a Progressive Vision-Language Prompt Decoder. The method uses a coarse-to-fine visual feature extractor (MGFE) and a sequence of attention-driven prompts to integrate fine-grained textures, topological cues, and textual priors, enabling robust segmentation on imbalanced PanNuke data. Empirical results show MONCH achieves state-of-the-art IoU, FWIoU, and F1 scores across multiple organ types, while maintaining efficiency relative to multi-branch approaches. The approach demonstrates the practical potential of vision-language fusion and multi-scale, multi-modal feature integration for pathology-level segmentation tasks.

Abstract

Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi-organ, multi-class cell segmentation is particularly challenging. Most existing methods employ multi-branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi-class analysis due to intricate textural details. To address these challenges, we propose a Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision-language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse-to-fine-grained features for segmenting cells of various shapes, including high-frequency, convolutional, and topological features. Inspired by the synergy of textual and multi-grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state-of-the-art cell segmentation methods and vision-language models. Codes and implementations will be made publicly available.

Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

TL;DR

This work presents MONCH, a single-branch network for multi-organ, multi-class cell semantic segmentation that fuses textual cell attributes with multi-grained visual features via a Progressive Vision-Language Prompt Decoder. The method uses a coarse-to-fine visual feature extractor (MGFE) and a sequence of attention-driven prompts to integrate fine-grained textures, topological cues, and textual priors, enabling robust segmentation on imbalanced PanNuke data. Empirical results show MONCH achieves state-of-the-art IoU, FWIoU, and F1 scores across multiple organ types, while maintaining efficiency relative to multi-branch approaches. The approach demonstrates the practical potential of vision-language fusion and multi-scale, multi-modal feature integration for pathology-level segmentation tasks.

Abstract

Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi-organ, multi-class cell segmentation is particularly challenging. Most existing methods employ multi-branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi-class analysis due to intricate textural details. To address these challenges, we propose a Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision-language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse-to-fine-grained features for segmenting cells of various shapes, including high-frequency, convolutional, and topological features. Inspired by the synergy of textual and multi-grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state-of-the-art cell segmentation methods and vision-language models. Codes and implementations will be made publicly available.

Paper Structure

This paper contains 19 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the proposed method. Texture feature extraction: Textual features are extracted via a frozen text encoder based on GPT-generated cell attributes. Multi-grained visual feature extraction: Multi-grained visual features are obtained from a pre-trained image encoder and enhanced via specific feature extraction modules. $HF^2EM$ is a high-frequency extraction module, $Conv 3*3$ is a convolutional block, and $TSEM$ is a topological structure extraction module. Feature fusion: Multi-scale visual features are integrated using feature pyramid fusion block. Progressive prompt decoder: Multimodal features are progressively input into the cross-attention module as prompts to lower-level features, harmonizing the discrepancy between multi-grained visual and linguistic features.
  • Figure 2: Progressive Vision-Language Prompt Decoder: Multimodal information, including textual features and multi-grained visual features, progressively serve as queries in a multi-head self-attention to harmonize features from fine-coarse-fine granularity.
  • Figure 3: PanNuke Cell Distribution Map. Distribution of each of the 19 organ types and 5 cell types.
  • Figure 4: Visualization of multi-organ, multi-cell semantic segmentation in PanNuke.
  • Figure 5: F1 Score of evaluation against SOTA cell segmentation methods in organ types from PanNuke.
  • ...and 2 more figures