Table of Contents
Fetching ...

Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models

Santiago Moreno, Pablo Meseguer, Rocío del Amor, Valery Naranjo

TL;DR

The paper tackles the challenge of annotating skin neoplasms in gigapixel whole-slide images by introducing ZEUS, a zero-shot segmentation pipeline that leverages frozen vision-language foundation models and class-specific prompt ensembles. By tiling WSIs into patches, extracting visual features, and matching them to text-derived prototypes via cosine similarity, ZEUS generates high-resolution tumor masks without pixel-level annotations. Experiments on two in-house datasets (primary spindle cell neoplasms and cutaneous metastases) demonstrate competitive zero-shot segmentation performance and reveal how prompt design, domain shifts, and institutional variability shape results. The work highlights the potential to reduce annotation burden and enable scalable, explainable tumor delineation in diagnostic workflows, while also pointing to domain adaptation and multi-scale strategies to improve robustness.

Abstract

Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.

Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models

TL;DR

The paper tackles the challenge of annotating skin neoplasms in gigapixel whole-slide images by introducing ZEUS, a zero-shot segmentation pipeline that leverages frozen vision-language foundation models and class-specific prompt ensembles. By tiling WSIs into patches, extracting visual features, and matching them to text-derived prototypes via cosine similarity, ZEUS generates high-resolution tumor masks without pixel-level annotations. Experiments on two in-house datasets (primary spindle cell neoplasms and cutaneous metastases) demonstrate competitive zero-shot segmentation performance and reveal how prompt design, domain shifts, and institutional variability shape results. The work highlights the potential to reduce annotation burden and enable scalable, explainable tumor delineation in diagnostic workflows, while also pointing to domain adaptation and multi-scale strategies to improve robustness.

Abstract

Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.

Paper Structure

This paper contains 9 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of ZEUS workflow. Top: a pre-trained vision encoder ($f_V$) is used to extract features from WIS patches ($v_N$). Textual prompts for the C classes are encoded using the text encoder ($f_T$) to obtain the text embeddings and the ensembling function ($f_p$) to obtain the prompt ensembles. Bottom: Cosine similarities ($s_{j,c})$ between text and patch embeddings yield class‐specific similarity maps ($S_C$), which are stacked and passed through a pixel‐wise $\arg\max$ to generate the final segmentation mask $\hat{Y}$.
  • Figure 2: Contours of the predicted segmentation masks for both models CONCH (blue) and KEEP (red) with their respective DSC value, and pathologist annotation (green) on a HUSC leiomyoma WSI.