A Dual-branch Self-supervised Representation Learning Framework for Tumour Segmentation in Whole Slide Images

Hao Wang; Euijoon Ahn; Jinman Kim

A Dual-branch Self-supervised Representation Learning Framework for Tumour Segmentation in Whole Slide Images

Hao Wang, Euijoon Ahn, Jinman Kim

TL;DR

This work tackles the annotation bottleneck in whole-slide image (WSI) tumour segmentation by introducing DSF-WSI, a dual-branch self-supervised framework that learns from multi-resolution WSIs. It combines a Context-Target Fusion Module (CTFM) with a masked jigsaw pretext task and Dense SimSiam Learning (DSL) to align context and target features across resolutions. Across BCSS and PAIP2019 datasets, DSF-WSI yields state-of-the-art segmentation performance in both fine-tuning and semi-supervised settings, outperforming supervised baselines and prior SSL methods. The approach demonstrates that cross-branch communication and multi-resolution representation learning can significantly reduce annotation needs while delivering robust, discriminative tumor segmentation, with future work extending to more resolutions and transformer-based architectures.

Abstract

Supervised deep learning methods have achieved considerable success in medical image analysis, owing to the availability of large-scale and well-annotated datasets. However, creating such datasets for whole slide images (WSIs) in histopathology is a challenging task due to their gigapixel size. In recent years, self-supervised learning (SSL) has emerged as an alternative solution to reduce the annotation overheads in WSIs, as it does not require labels for training. These SSL approaches, however, are not designed for handling multi-resolution WSIs, which limits their performance in learning discriminative image features. In this paper, we propose a Dual-branch SSL Framework for WSI tumour segmentation (DSF-WSI) that can effectively learn image features from multi-resolution WSIs. Our DSF-WSI connected two branches and jointly learnt low and high resolution WSIs in a self-supervised manner. Moreover, we introduced a novel Context-Target Fusion Module (CTFM) and a masked jigsaw pretext task to align the learnt multi-resolution features. Furthermore, we designed a Dense SimSiam Learning (DSL) strategy to maximise the similarity of different views of WSIs, enabling the learnt representations to be more efficient and discriminative. We evaluated our method using two public datasets on breast and liver cancer segmentation tasks. The experiment results demonstrated that our DSF-WSI can effectively extract robust and efficient representations, which we validated through subsequent fine-tuning and semi-supervised settings. Our proposed method achieved better accuracy than other state-of-the-art approaches. Code is available at https://github.com/Dylan-H-Wang/dsf-wsi.

A Dual-branch Self-supervised Representation Learning Framework for Tumour Segmentation in Whole Slide Images

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 4 figures, 7 tables)

This paper contains 27 sections, 10 equations, 4 figures, 7 tables.

Introduction
Related Work
Recent SSL Representation Learning Works
Recent WSI Segmentation Works (non SSL)
SSL in WSI Analysis
Method
Preliminaries: Micros Per Pixel
Overview
Contrastive learning method
Multi-resolution SSL pipeline with CTFM
Dense SimSiam learning
Fine-tuning and inference
Experiments and Discussion
Datasets
BCSS dataset
...and 12 more sections

Figures (4)

Figure 1: An example of WSI patches extracted from different resolutions. When the sliding window size is fixed at $224 \times 224$, more patches from the high-resolution WSI are required to cover the same field of view as a low-resolution patch. Specifically, in this example, it is necessary to use four $20\times$ patches to achieve the same field of view as one $10\times$ patch.
Figure 2: a) Overview of DSF-WSI. Firstly, multi-resolution inputs $x$ are transformed by $T$, $T'$ into two views $x_1$ and $x_2$. Then, they are fed into the backbone network $f$ and refined by the proposed CTFM to generate context features, target features and cross-branch features. After that, DSL is applied to maximise the similarity of these three features produced from each model stage. b) Backbone architecture and CTFM: two same structure networks (do not share weights) are used to process two resolution images separately and generate context features and target features. Then, the CTFM will randomly mask and shuffle target features and concatenate them with context features to derive cross-branch features. c) DSL: for $n$ intermediate model layer features $\{k_1, ..., k_n\}$, they are fed into projectors $\{g_1, ..., g_n\}$ and predictors $\{h_1, ..., h_n\}$ which are then compared with another views $\{k_1', ..., k_n'\}$ to maximise the similarity.
Figure 3: A schematic of SimSiam. The same encoder model $f$ processed two views of an image $x$ generated by two different data transformations, $T$ and $T'$. A projector $g$ and predictor $h$ were then applied to obtain corresponding projection embeddings and output vectors. The objective of SimSiam is to maximise the similarity between $h(g(f(T(x))))$ and $g(f(T'(x)))$.
Figure 4: A schematic of HookNet. Skip connections for both branches are omitted for clarity. Feature maps were scaled by a factor of $2$. In this example, the feature maps at depth $2$ in the decoder part of the context branch have the same resolution as the feature maps in the bottleneck of the target branch. Thus, the context and target information were then combined by cropping and concatenation.

A Dual-branch Self-supervised Representation Learning Framework for Tumour Segmentation in Whole Slide Images

TL;DR

Abstract

A Dual-branch Self-supervised Representation Learning Framework for Tumour Segmentation in Whole Slide Images

Authors

TL;DR

Abstract

Table of Contents

Figures (4)