Singpath-VL Technical Report
Zhen Qiu, Kaiwen Xiao, Zhengwei Lu, Xiangyu Liu, Lei Zhao, Hao Zhang
TL;DR
Singpath-VL tackles the data bottleneck in cervical cytology by constructing Singpath-CytoText, a million-scale image–description dataset generated through a three-stage pipeline that fuses outputs from multiple open-source MLLMs and expert refinement. The authors then fine-tune Qwen3-VL-4B with a three-stage training recipe—vision–language alignment, instruction-following fine-tuning, and knowledge replay—to produce a domain-specialized cytopathology MLLM. Evaluations on MorphoPercept-Bench and CytoCell-Bench demonstrate superior fine-grained morphological perception and competitive, balanced cell-level classification under The Bethesda System, with NILM achieving perfect NILM classification and improvements in ambiguous categories. The work highlights the value of high-quality synthetic cytology data and careful domain adaptation, and plans to open-source a portion of the dataset to foster reproducibility and further research in cytopathology AI assistants.
Abstract
We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
