Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

Chengxi Zeng; Xinyu Yang; David Smithard; Majid Mirmehdi; Alberto M Gambaruto; Tilo Burghardt

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

Chengxi Zeng, Xinyu Yang, David Smithard, Majid Mirmehdi, Alberto M Gambaruto, Tilo Burghardt

TL;DR

This work tackles VFSS video segmentation by explicitly leveraging temporal information through a Temporal Context Module and a Swin Transformer-based encoder within an UNet-like architecture, enabling robust spatio-temporal feature learning. The proposed Video-SwinUNet processes short video snippets $x \in \mathbb{R}^{t \times H \times W}$ with a ResNet-50 backbone, fuses temporal context, and encodes it with a hierarchical Swin Transformer before decoding with a CNN-based up-sampler. Evaluations on VFSS2022 Part1/Part2 show state-of-the-art Dice scores of $0.8986$/$0.8186$ and competitive HD95 metrics, with ablations validating the temporal blending and transfer-learning capabilities. The work provides strong evidence that temporal dynamics improve medical video segmentation and demonstrates good generalization across datasets, supported by publicly available code.

Abstract

This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

TL;DR

with a ResNet-50 backbone, fuses temporal context, and encodes it with a hierarchical Swin Transformer before decoding with a CNN-based up-sampler. Evaluations on VFSS2022 Part1/Part2 show state-of-the-art Dice scores of

and competitive HD95 metrics, with ablations validating the temporal blending and transfer-learning capabilities. The work provides strong evidence that temporal dynamics improve medical video segmentation and demonstrates good generalization across datasets, supported by publicly available code.

Abstract

Paper Structure (11 sections, 2 equations, 3 figures, 3 tables)

This paper contains 11 sections, 2 equations, 3 figures, 3 tables.

Introduction and Related Work
Methodology
Architecture Overview
Temporal Context Module
Swin Transformer
EXPERIMENTS AND RESULTS
Datasets and Implementation details
Comparison with the state of the art
Ablation study
Transfer learning
Conclusion

Figures (3)

Figure 1: Video-SwinUNet Architecture Overview.(a)A ResNet-50 CNN feature extractor; (b)Temporal Context Module for temporal feature blending; (c)A Swin transformer-based feature encoder; (d)Cascaded CNN up-sampler for segmentation reconstruction; (e)2-layer segmentation head for detailed pixel-wise label mapping. Three skip connections are bridged between the CNN feature extractor and up-sampler as well as from the temporal features.
Figure 2: Qualitative Results. Model segmentation results on 3 consecutive frames selected from VFSS Part2 dataset testset. All results are in instance pairs of bolus and pharynx predictions side by side. The red and blue outlines indicate the output segmentation and ground truth, respectively.(Best viewed zoomed)
Figure 3: Grad-CAM Visualisation. Comparing the two closest competing architectures, grad-cam maps show where the model pays attention. Note the cleaner focus of our proposed approach.(Best viewed zoomed)

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

TL;DR

Abstract

Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)