Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks
Lingran Song, Yucheng Zhou, Jianbing Shen
TL;DR
This work defines Medical Diagnosis Segmentation (MDS) to jointly derive pixel-level segmentation, diagnoses, and reasoning for medical images. It introduces the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset and the Sim4Seg framework, which uses Region-Aware Vision-Language Similarity to Mask (RVLS2M) to generate region-level prompts for precise segmentation, guided by diagnosis reasoning from an LVLM. A test-time scaling strategy further boosts performance by leveraging multiple CoT paths and prompts to select optimal masks. Across diverse modalities and datasets, Sim4Seg demonstrates strong gains in segmentation quality and diagnostic accuracy, with robust cross-modality and cross-dataset generalization. Together, these contributions advance unified visual-language medical reasoning with interpretable segmentation, enabling more actionable clinical insights.
Abstract
Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.
