When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

Ziyang Wang; Tianze Li; Jian-Qing Zheng; Baoru Huang

When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

Ziyang Wang, Tianze Li, Jian-Qing Zheng, Baoru Huang

TL;DR

This work tackles semi-supervised multi-class medical image segmentation under limited annotations by introducing S4CVnet, a dual-view framework that jointly leverages CNN and Vision Transformer (ViT) architectures. A feature-learning module enables two views to generate pseudo labels for mutual supervision, while a robust guidance module based on Exponential Moving Average (EMA) of network weights provides consistency-aware supervision. The method achieves state-of-the-art performance on an MRI ventricle dataset across multiple metrics, with extensive ablations and a topological exploration of supervision modes to map the landscape of CNN/ViT semi-supervised strategies. The approach is validated with thorough experiments, showing strong gains under varying labeled-data regimes and offering code for reproducibility.

Abstract

Due to the lack of quality annotation in medical imaging community, semi-supervised learning methods are highly valued in image semantic segmentation tasks. In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented to fully utilize the power of Vision Transformer(ViT) and Convolutional Neural Network(CNN) in semi-supervised learning. Our proposed framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes. The pseudo labels are inferred and utilized recurrently and separately by views of CNN and ViT in the feature-learning module to expand the data set and are beneficial to each other. Meanwhile, a perturbation scheme is designed for the feature-learning module, and averaging network weight is utilized to develop the guidance module. By doing so, the framework combines the feature-learning strength of CNN and ViT, strengthens the performance via dual-view co-training, and enables consistency-aware supervision in a semi-supervised manner. A topological exploration of all alternative supervision modes with CNN and ViT are detailed validated, demonstrating the most promising performance and specific setting of our method on semi-supervised medical image segmentation tasks. Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set with a variety of metrics. The code is publicly available.

When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

TL;DR

Abstract

Paper Structure (22 sections, 11 equations, 6 figures, 4 tables)

This paper contains 22 sections, 11 equations, 6 figures, 4 tables.

Introduction
Related Work
Methodology
CNN & ViT
Feature-Learning Module
Guidance Module
Objective
Experiments and Results
Data set
Implementation Details
Backbone
Baseline Methods
Evaluation Measures
Qualitative Results
Quantitative Results
...and 7 more sections

Figures (6)

Figure 1: The Example 2-Model-Based, 3-Model-Based, and 4-Model-Based SSL Framework for Image Segmentation. The supervision mechanism is illustrated by minimizing the difference (also known as $Loss$) between prediction and (pseudo) label. (a) The best 2-model-based SSL framework luo2021semi. (b) The pure ViT-based student-teacher style SSL framework. (c) The best 3-model-based SSL framework, i.e. S4CVnet. (d) The 4-model-based, 5-model-based SSL framework.
Figure 2: The Backbone Segmentation Network. (a,c)a U-shape CNN-based or ViT-based encoder-decoder style segmentation network, (b,d)a pure CNN-based or ViT-based network block. These two network blocks can be directly applied to the U-shape encoder-decoder network resulting in a purely CNN- or ViT-based segmentation network.
Figure 3: Sample Qualitative Results on MRI Cardiac Test Set. Yellow, Red, Green, and Black Indicate True Positive, False Positive, False Negative, and True Negative of Each Pixel.
Figure 4: The Performance of S4CVnet Against Other Baseline Methods. (a) The line chart of mIOU results on the test set with different assumptions of the ratio of label/total data for training. (b) The histogram chart indicates the cumulative distribution of IOU performance of the predicted image on the test set.
Figure 5: The Topological Exploration of the Network(CNN&ViT), and Semi-Supervised Supervision Mode (Student-Teacher Style & Pseudo-Label).
...and 1 more figures

When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

TL;DR

Abstract

When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)