Table of Contents
Fetching ...

Singpath-VL Technical Report

Zhen Qiu, Kaiwen Xiao, Zhengwei Lu, Xiangyu Liu, Lei Zhao, Hao Zhang

TL;DR

Singpath-VL tackles the data bottleneck in cervical cytology by constructing Singpath-CytoText, a million-scale image–description dataset generated through a three-stage pipeline that fuses outputs from multiple open-source MLLMs and expert refinement. The authors then fine-tune Qwen3-VL-4B with a three-stage training recipe—vision–language alignment, instruction-following fine-tuning, and knowledge replay—to produce a domain-specialized cytopathology MLLM. Evaluations on MorphoPercept-Bench and CytoCell-Bench demonstrate superior fine-grained morphological perception and competitive, balanced cell-level classification under The Bethesda System, with NILM achieving perfect NILM classification and improvements in ambiguous categories. The work highlights the value of high-quality synthetic cytology data and careful domain adaptation, and plans to open-source a portion of the dataset to foster reproducibility and further research in cytopathology AI assistants.

Abstract

We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.

Singpath-VL Technical Report

TL;DR

Singpath-VL tackles the data bottleneck in cervical cytology by constructing Singpath-CytoText, a million-scale image–description dataset generated through a three-stage pipeline that fuses outputs from multiple open-source MLLMs and expert refinement. The authors then fine-tune Qwen3-VL-4B with a three-stage training recipe—vision–language alignment, instruction-following fine-tuning, and knowledge replay—to produce a domain-specialized cytopathology MLLM. Evaluations on MorphoPercept-Bench and CytoCell-Bench demonstrate superior fine-grained morphological perception and competitive, balanced cell-level classification under The Bethesda System, with NILM achieving perfect NILM classification and improvements in ambiguous categories. The work highlights the value of high-quality synthetic cytology data and careful domain adaptation, and plans to open-source a portion of the dataset to foster reproducibility and further research in cytopathology AI assistants.

Abstract

We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
Paper Structure (18 sections, 5 figures, 4 tables)

This paper contains 18 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the multi-stage training recipe for Singpath-VL. The pipeline consists of three consecutive stages: (1) Vision-Language Alignment for domain grounding, (2) Supervised Fine-Tuning for instruction following, and (3) Knowledge Replay for mitigating catastrophic forgetting, resulting in a specialized and robust cervical cytology MLLM.
  • Figure 2: Comparison of average performance scores (%) on MorphoPercept-Bench. This bar chart aggregates the mean values across all key morphological observations.
  • Figure 3: Comparison of average performance scores (%) on CytoCell-Bench. This bar chart aggregates the mean values across 6 diagnostic categories.
  • Figure 4: Qualitative comparison of responses from three vision-language models (Singpath-VL, Qwen3-VL-4B, and InternVL-3.5-38B from left to right) on a sample of Negative for Intraepithelial Lesion or Malignancy (NILM).
  • Figure 5: Qualitative comparison of responses from three vision-language models (Singpath-VL, Qwen3-VL-4B, and InternVL-3.5-38B from left to right) on a sample of Low-Grade Squamous Intraepithelial Lesion (LSIL).