Table of Contents
Fetching ...

PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM

Debayan Bhattacharya, Konrad Reuter, Finn Behrendt, Lennart Maack, Sarah Grube, Alexander Schlaefer

TL;DR

The paper tackles the lack of temporal context in single-image polyp segmentation by introducing PolypNextLSTM, a lightweight video segmentation network. It combines a pruned ConvNext-Tiny encoder with a bidirectional ConvLSTM for temporal fusion within a UNet-like framework, achieving strong segmentation while maintaining edge-friendly parameters. On the SUN-SEG dataset, it outperforms state-of-the-art image and video models, notably achieving higher Dice scores on hard unseen cases and delivering real-time performance. The work demonstrates the practical potential for fast, accurate video polyp segmentation on resource-constrained devices and provides open-source code for broader adoption.

Abstract

Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion. Comparison against 5 image-based and 5 video-based models demonstrates PolypNextLSTM's superiority, achieving a Dice score of 0.7898 on the hard-to-detect polyp test set, surpassing image-based PraNet (0.7519) and video-based PNSPlusNet (0.7486). Notably, our model excels in videos featuring complex artefacts such as ghosting and occlusion. PolypNextLSTM, integrating pruned ConvNext-Tiny with ConvLSTM for temporal fusion, not only exhibits superior segmentation performance but also maintains the highest frames per speed among evaluated models. Access code here https://github.com/mtec-tuhh/PolypNextLSTM

PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM

TL;DR

The paper tackles the lack of temporal context in single-image polyp segmentation by introducing PolypNextLSTM, a lightweight video segmentation network. It combines a pruned ConvNext-Tiny encoder with a bidirectional ConvLSTM for temporal fusion within a UNet-like framework, achieving strong segmentation while maintaining edge-friendly parameters. On the SUN-SEG dataset, it outperforms state-of-the-art image and video models, notably achieving higher Dice scores on hard unseen cases and delivering real-time performance. The work demonstrates the practical potential for fast, accurate video polyp segmentation on resource-constrained devices and provides open-source code for broader adoption.

Abstract

Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion. Comparison against 5 image-based and 5 video-based models demonstrates PolypNextLSTM's superiority, achieving a Dice score of 0.7898 on the hard-to-detect polyp test set, surpassing image-based PraNet (0.7519) and video-based PNSPlusNet (0.7486). Notably, our model excels in videos featuring complex artefacts such as ghosting and occlusion. PolypNextLSTM, integrating pruned ConvNext-Tiny with ConvLSTM for temporal fusion, not only exhibits superior segmentation performance but also maintains the highest frames per speed among evaluated models. Access code here https://github.com/mtec-tuhh/PolypNextLSTM
Paper Structure (10 sections, 4 figures, 1 table)

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The proposed Model. A reduced ConvNext-tiny is used as the encoder. The information between the encoded frames is fused using a bidirecitonal ConvLSTM. The decoder is inspired by the UNet. $F$ is the number of subsequent frames being processed simultaneously by the model.
  • Figure 2: Key components of our network: (a) ConvNextBlock - main encoder building block. (b) Bidirectional ConvLSTM - fuses information across frames. (c) DoubleConv module - merges skip connection data with upsampled information. (d) Patch embedding layer - serves as the encoder's input. (e) Downsampling layers and (f) Upsampling layers. (g) Output layer - reduces channel dimension to 1.
  • Figure 3: Example results for cases where our model performed considerably better than other state-of-the-art models. The left four images are from the "Easy Unseen" test set and right frames from the "Hard Unseen" test set.
  • Figure 4: Variation of the number of frames for the different test set configurations for four different metrics. The coloured interval refers to the minimum and maximum of the cross-validation. Black circle shows the highest metric.