Skip and Skip: Segmenting Medical Images with Prompts
Jiawei Chen, Dingkang Yang, Yuxuan Lei, Lihua Zhang
TL;DR
This paper addresses medical image lesion segmentation under limited pixel-level annotations by leveraging image-level labels as coarse-grained prompts. It introduces SKS, a dual U-shaped two-stage architecture that uses a coarse-grained feature guidance branch to provide pyramid features to a fine-grained segmentation branch through three skips, enabling effective prompting. A three-layer Swin-T v2 pyramid extracts both coarse and fine features; a fusion rule $F_{fuse} = W \times (f^{i}_{\lambda} \oplus f^{k}_{\lambda}) + b$ enables the Connection's Artful Leap to couple information across levels. On the LITS dataset, SKS achieves a Dice score of $0.549$ with only $35$ finely annotated scans, outperforming pixel-annotation-only baselines like U-Net ($0.489$) and U-Net++ ($0.509$). These results demonstrate that exploiting existing image-level diagnoses can significantly reduce labeling burden while maintaining segmentation accuracy, with potential applicability to other medical imaging tasks.
Abstract
Most medical image lesion segmentation methods rely on hand-crafted accurate annotations of the original image for supervised learning. Recently, a series of weakly supervised or unsupervised methods have been proposed to reduce the dependence on pixel-level annotations. However, these methods are essentially based on pixel-level annotation, ignoring the image-level diagnostic results of the current massive medical images. In this paper, we propose a dual U-shaped two-stage framework that utilizes image-level labels to prompt the segmentation. In the first stage, we pre-train a classification network with image-level labels, which is used to obtain the hierarchical pyramid features and guide the learning of downstream branches. In the second stage, we feed the hierarchical features obtained from the classification branch into the downstream branch through short-skip and long-skip and get the lesion masks under the supervised learning of pixel-level labels. Experiments show that our framework achieves better results than networks simply using pixel-level annotations.
