When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation
Ziyang Wang, Tianze Li, Jian-Qing Zheng, Baoru Huang
TL;DR
This work tackles semi-supervised multi-class medical image segmentation under limited annotations by introducing S4CVnet, a dual-view framework that jointly leverages CNN and Vision Transformer (ViT) architectures. A feature-learning module enables two views to generate pseudo labels for mutual supervision, while a robust guidance module based on Exponential Moving Average (EMA) of network weights provides consistency-aware supervision. The method achieves state-of-the-art performance on an MRI ventricle dataset across multiple metrics, with extensive ablations and a topological exploration of supervision modes to map the landscape of CNN/ViT semi-supervised strategies. The approach is validated with thorough experiments, showing strong gains under varying labeled-data regimes and offering code for reproducibility.
Abstract
Due to the lack of quality annotation in medical imaging community, semi-supervised learning methods are highly valued in image semantic segmentation tasks. In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented to fully utilize the power of Vision Transformer(ViT) and Convolutional Neural Network(CNN) in semi-supervised learning. Our proposed framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes. The pseudo labels are inferred and utilized recurrently and separately by views of CNN and ViT in the feature-learning module to expand the data set and are beneficial to each other. Meanwhile, a perturbation scheme is designed for the feature-learning module, and averaging network weight is utilized to develop the guidance module. By doing so, the framework combines the feature-learning strength of CNN and ViT, strengthens the performance via dual-view co-training, and enables consistency-aware supervision in a semi-supervised manner. A topological exploration of all alternative supervision modes with CNN and ViT are detailed validated, demonstrating the most promising performance and specific setting of our method on semi-supervised medical image segmentation tasks. Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set with a variety of metrics. The code is publicly available.
