Table of Contents
Fetching ...

S3TU-Net: Structured Convolution and Superpixel Transformer for Lung Nodule Segmentation

Yuke Wu, Xiang Liu, Yunyu Shi, Xinyi Chen, Zhenglei Wang, YuQing Xu, Shuo Hong Wang

Abstract

The irregular and challenging characteristics of lung adenocarcinoma nodules in computed tomography (CT) images complicate staging diagnosis, making accurate segmentation critical for clinicians to extract detailed lesion information. In this study, we propose a segmentation model, S3TU-Net, which integrates multi-dimensional spatial connectors and a superpixel-based visual transformer. S3TU-Net is built on a multi-view CNN-Transformer hybrid architecture, incorporating superpixel algorithms, structured weighting, and spatial shifting techniques to achieve superior segmentation performance. The model leverages structured convolution blocks (DWF-Conv/D2BR-Conv) to extract multi-scale local features while mitigating overfitting. To enhance multi-scale feature fusion, we introduce the S2-MLP Link, integrating spatial shifting and attention mechanisms at the skip connections. Additionally, the residual-based superpixel visual transformer (RM-SViT) effectively merges global and local features by employing sparse correlation learning and multi-branch attention to capture long-range dependencies, with residual connections enhancing stability and computational efficiency. Experimental results on the LIDC-IDRI dataset demonstrate that S3TU-Net achieves a DSC, precision, and IoU of 89.04%, 90.73%, and 90.70%, respectively. Compared to recent methods, S3TU-Net improves DSC by 4.52% and sensitivity by 3.16%, with other metrics showing an approximate 2% increase. In addition to comparison and ablation studies, we validated the generalization ability of our model on the EPDB private dataset, achieving a DSC of 86.40%.

S3TU-Net: Structured Convolution and Superpixel Transformer for Lung Nodule Segmentation

Abstract

The irregular and challenging characteristics of lung adenocarcinoma nodules in computed tomography (CT) images complicate staging diagnosis, making accurate segmentation critical for clinicians to extract detailed lesion information. In this study, we propose a segmentation model, S3TU-Net, which integrates multi-dimensional spatial connectors and a superpixel-based visual transformer. S3TU-Net is built on a multi-view CNN-Transformer hybrid architecture, incorporating superpixel algorithms, structured weighting, and spatial shifting techniques to achieve superior segmentation performance. The model leverages structured convolution blocks (DWF-Conv/D2BR-Conv) to extract multi-scale local features while mitigating overfitting. To enhance multi-scale feature fusion, we introduce the S2-MLP Link, integrating spatial shifting and attention mechanisms at the skip connections. Additionally, the residual-based superpixel visual transformer (RM-SViT) effectively merges global and local features by employing sparse correlation learning and multi-branch attention to capture long-range dependencies, with residual connections enhancing stability and computational efficiency. Experimental results on the LIDC-IDRI dataset demonstrate that S3TU-Net achieves a DSC, precision, and IoU of 89.04%, 90.73%, and 90.70%, respectively. Compared to recent methods, S3TU-Net improves DSC by 4.52% and sensitivity by 3.16%, with other metrics showing an approximate 2% increase. In addition to comparison and ablation studies, we validated the generalization ability of our model on the EPDB private dataset, achieving a DSC of 86.40%.

Paper Structure

This paper contains 21 sections, 17 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The overall framework of S3TU-Net. The framework is divided into three broad categories of modules, two novel convolutional modules (DWF-Conv/ D2BR-Conv), multi-spatial dimensional connectors (S2-MLP Link), and residual connection-based superpixel vision transformer (RM-SViT).
  • Figure 2: The architecture of traditional convolutional block. (a) is the convolution module in Squeeze-and-Excitation network. (b) is the traditional convolution module in the UNet network.
  • Figure 3: The architecture of Newly proposed convolutional block. (a) is the convolutional block named D2BR, which is composed of $3 \times 3$ Conv, DropBlock, BN, and ReLU. (b) is a convolutional block named DWF, which is composed of a combination of $3 \times 3$ Conv, BN, LKA module, and ReLU with different scaling weight values. The LKA module contains multiple large kernel convolutions, depth convolutions, and pointwise convolutions that can expand the receptive field.
  • Figure 4: The architecture of RM-SViT Module. The encoder expands the feature tensor, divides 'Tokens' into 'Super tokens' by sparse association learning, then adjusts the final 'Super Token' by applying multi-branch self-attention based on residual connection after corresponding rounds of iteration, and finally maps the expanded local block back to the original Token space.
  • Figure 5: The architecture of S2-MLP Link Module. Firstly, MLP is used to expand the channel c of the feature map into $3 \times c$ and divide it into three parts ($F_{1}$,$F_{2}$,$F_{3}$) along the channel dimension. $F_{1}$ and $F_{2}$ are spatially shifted according to different directions, and $F_{3}$ remains unchanged. Then, Split Attention is used for weighting calculation, and finally MLP is used for recovery.
  • ...and 4 more figures