Table of Contents
Fetching ...

SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection

Jia Wei, Yun Li, Xiaomao Fan, Wenjun Ma, Meiyu Qiu, Hongyu Chen, Wenbin Lei

TL;DR

SAM-Swin, an innovative SAM-driven Dual-Swin Transformer for laryngo-pharyngeal tumor detection and a multi-scale lesion-aware enhancement module (MS-LAEM) designed to adaptively enhance the learning of nuanced complementary features across various scales, improving the quality of feature extraction and representation.

Abstract

Laryngo-pharyngeal cancer (LPC) is a highly lethal malignancy in the head and neck region. Recent advancements in tumor detection, particularly through dual-branch network architectures, have significantly improved diagnostic accuracy by integrating global and local feature extraction. However, challenges remain in accurately localizing lesions and fully capitalizing on the complementary nature of features within these branches. To address these issues, we propose SAM-Swin, an innovative SAM-driven Dual-Swin Transformer for laryngo-pharyngeal tumor detection. This model leverages the robust segmentation capabilities of the Segment Anything Model 2 (SAM2) to achieve precise lesion segmentation. Meanwhile, we present a multi-scale lesion-aware enhancement module (MS-LAEM) designed to adaptively enhance the learning of nuanced complementary features across various scales, improving the quality of feature extraction and representation. Furthermore, we implement a multi-scale class-aware guidance (CAG) loss that delivers multi-scale targeted supervision, thereby enhancing the model's capacity to extract class-specific features. To validate our approach, we compiled three LPC datasets from the First Affiliated Hospital (FAHSYSU), the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University, and Nanfang Hospital of Southern Medical University (NHSMU). The FAHSYSU dataset is utilized for internal training, while the SAHSYSU and NHSMU datasets serve for external evaluation. Extensive experiments demonstrate that SAM-Swin outperforms state-of-the-art methods, showcasing its potential for advancing LPC detection and improving patient outcomes. The source code of SAM-Swin is available at the URL of \href{https://github.com/VVJia/SAM-Swin}{https://github.com/VVJia/SAM-Swin}.

SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection

TL;DR

SAM-Swin, an innovative SAM-driven Dual-Swin Transformer for laryngo-pharyngeal tumor detection and a multi-scale lesion-aware enhancement module (MS-LAEM) designed to adaptively enhance the learning of nuanced complementary features across various scales, improving the quality of feature extraction and representation.

Abstract

Laryngo-pharyngeal cancer (LPC) is a highly lethal malignancy in the head and neck region. Recent advancements in tumor detection, particularly through dual-branch network architectures, have significantly improved diagnostic accuracy by integrating global and local feature extraction. However, challenges remain in accurately localizing lesions and fully capitalizing on the complementary nature of features within these branches. To address these issues, we propose SAM-Swin, an innovative SAM-driven Dual-Swin Transformer for laryngo-pharyngeal tumor detection. This model leverages the robust segmentation capabilities of the Segment Anything Model 2 (SAM2) to achieve precise lesion segmentation. Meanwhile, we present a multi-scale lesion-aware enhancement module (MS-LAEM) designed to adaptively enhance the learning of nuanced complementary features across various scales, improving the quality of feature extraction and representation. Furthermore, we implement a multi-scale class-aware guidance (CAG) loss that delivers multi-scale targeted supervision, thereby enhancing the model's capacity to extract class-specific features. To validate our approach, we compiled three LPC datasets from the First Affiliated Hospital (FAHSYSU), the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University, and Nanfang Hospital of Southern Medical University (NHSMU). The FAHSYSU dataset is utilized for internal training, while the SAHSYSU and NHSMU datasets serve for external evaluation. Extensive experiments demonstrate that SAM-Swin outperforms state-of-the-art methods, showcasing its potential for advancing LPC detection and improving patient outcomes. The source code of SAM-Swin is available at the URL of \href{https://github.com/VVJia/SAM-Swin}{https://github.com/VVJia/SAM-Swin}.

Paper Structure

This paper contains 28 sections, 13 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The overall architecture of SAM-Swin. SAM-Swin consists of four key parts: a SAM2-guided lesion location module (SAM2-GLLM), a whole image branch (WIB), a lesion region branch (LRB), and a multi-scale lesion-aware enhancement module (MS-LAEM).
  • Figure 2: Illustration of the workflow of SAM2-Guided Lesion Location Module (SAM2-GLLM). The whole image $x_w$ is processed by SAM2, generating the corresponding lesion mask $m_w$. Points $P_1$ and $P_2$ are selected based on the foreground, defined as the region where $m_w(x,y)=255$. The lesion region image $x_l$ is then cropped from the $x_w$ using the coordinates of these two points.
  • Figure 3: The illustration of lesion-aware enhancement module (LAEM). Query tokens are generated from the lesion region tokens, while key and value tokens are produced from the whole image tokens. These query, key, and value tokens are then processed through Multi-Head Attention (MHA) to derive enhanced feature tokens, which contain richer lesion-specific feature representations. Subsequently, the learnable, zero-initialized gating factor is applied to multiply the enhanced feature tokens, adaptively adjusting the importance of the lesion features. Lastly, these enhanced feature tokens are combined with the original whole image tokens to produce the final tokens.
  • Figure 4: Confusion matrices obtained by our proposed SAM-Swin and other comparative methods on the FAHSYSU dataset. (a) VGGNet, (b) ResNet, (c) DenseNet, (d) EfficientNet, (e) ViT, (f) SwinV2, (g) RadFormer, (h) DLGNet, (i) SAM-FNet, (j) SAM-Swin.
  • Figure 5: Illustrations of the Grad-CAM visualization on the FAHSYSU, SAHSYSU, and NHSMU datasets.
  • ...and 5 more figures