Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation
Chengxi Zeng, Xinyu Yang, David Smithard, Majid Mirmehdi, Alberto M Gambaruto, Tilo Burghardt
TL;DR
This work tackles VFSS video segmentation by explicitly leveraging temporal information through a Temporal Context Module and a Swin Transformer-based encoder within an UNet-like architecture, enabling robust spatio-temporal feature learning. The proposed Video-SwinUNet processes short video snippets $x \in \mathbb{R}^{t \times H \times W}$ with a ResNet-50 backbone, fuses temporal context, and encodes it with a hierarchical Swin Transformer before decoding with a CNN-based up-sampler. Evaluations on VFSS2022 Part1/Part2 show state-of-the-art Dice scores of $0.8986$/$0.8186$ and competitive HD95 metrics, with ablations validating the temporal blending and transfer-learning capabilities. The work provides strong evidence that temporal dynamics improve medical video segmentation and demonstrates good generalization across datasets, supported by publicly available code.
Abstract
This paper presents a deep learning framework for medical video segmentation. Convolution neural network (CNN) and transformer-based methods have achieved great milestones in medical image segmentation tasks due to their incredible semantic feature encoding and global information comprehension abilities. However, most existing approaches ignore a salient aspect of medical video data - the temporal dimension. Our proposed framework explicitly extracts features from neighbouring frames across the temporal dimension and incorporates them with a temporal feature blender, which then tokenises the high-level spatio-temporal feature to form a strong global feature encoded via a Swin Transformer. The final segmentation results are produced via a UNet-like encoder-decoder architecture. Our model outperforms other approaches by a significant margin and improves the segmentation benchmarks on the VFSS2022 dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets tested. Our studies also show the efficacy of the temporal feature blending scheme and cross-dataset transferability of learned capabilities. Code and models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet.
