Table of Contents
Fetching ...

SAM-FNet: SAM-Guided Fusion Network for Laryngo-Pharyngeal Tumor Detection

Jia Wei, Yun Li, Meiyu Qiu, Hongyu Chen, Xiaomao Fan, Wenbin Lei

TL;DR

This paper tackles automatic laryngo-pharyngeal tumor detection from endoscopic images by introducing SAM-FNet, a dual-branch network that fuses global and local lesion features. It leverages a LoRA-tuned SAM for precise lesion localization (SLL) and a GAN-like feature optimization (GFO) to enhance the complementary learning between global and local representations, evaluated on FAHSYSU (internal) and SAHSYSU (external) datasets. The approach yields state-of-the-art or competitive results across accuracy, precision, recall, and F1, with ablations confirming the benefits of combining global/local features and the adversarial fusion. The work demonstrates strong lesion localization, robust cross-domain generalization, and practical potential for assisting laryngologists in LPC diagnosis, with public code available for reproducibility.

Abstract

Laryngo-pharyngeal cancer (LPC) is a highly fatal malignant disease affecting the head and neck region. Previous studies on endoscopic tumor detection, particularly those leveraging dual-branch network architectures, have shown significant advancements in tumor detection. These studies highlight the potential of dual-branch networks in improving diagnostic accuracy by effectively integrating global and local (lesion) feature extraction. However, they are still limited in their capabilities to accurately locate the lesion region and capture the discriminative feature information between the global and local branches. To address these issues, we propose a novel SAM-guided fusion network (SAM-FNet), a dual-branch network for laryngo-pharyngeal tumor detection. By leveraging the powerful object segmentation capabilities of the Segment Anything Model (SAM), we introduce the SAM into the SAM-FNet to accurately segment the lesion region. Furthermore, we propose a GAN-like feature optimization (GFO) module to capture the discriminative features between the global and local branches, enhancing the fusion feature complementarity. Additionally, we collect two LPC datasets from the First Affiliated Hospital (FAHSYSU) and the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University. The FAHSYSU dataset is used as the internal dataset for training the model, while the SAHSYSU dataset is used as the external dataset for evaluating the model's performance. Extensive experiments on both datasets of FAHSYSU and SAHSYSU demonstrate that the SAM-FNet can achieve competitive results, outperforming the state-of-the-art counterparts. The source code of SAM-FNet is available at the URL of https://github.com/VVJia/SAM-FNet.

SAM-FNet: SAM-Guided Fusion Network for Laryngo-Pharyngeal Tumor Detection

TL;DR

This paper tackles automatic laryngo-pharyngeal tumor detection from endoscopic images by introducing SAM-FNet, a dual-branch network that fuses global and local lesion features. It leverages a LoRA-tuned SAM for precise lesion localization (SLL) and a GAN-like feature optimization (GFO) to enhance the complementary learning between global and local representations, evaluated on FAHSYSU (internal) and SAHSYSU (external) datasets. The approach yields state-of-the-art or competitive results across accuracy, precision, recall, and F1, with ablations confirming the benefits of combining global/local features and the adversarial fusion. The work demonstrates strong lesion localization, robust cross-domain generalization, and practical potential for assisting laryngologists in LPC diagnosis, with public code available for reproducibility.

Abstract

Laryngo-pharyngeal cancer (LPC) is a highly fatal malignant disease affecting the head and neck region. Previous studies on endoscopic tumor detection, particularly those leveraging dual-branch network architectures, have shown significant advancements in tumor detection. These studies highlight the potential of dual-branch networks in improving diagnostic accuracy by effectively integrating global and local (lesion) feature extraction. However, they are still limited in their capabilities to accurately locate the lesion region and capture the discriminative feature information between the global and local branches. To address these issues, we propose a novel SAM-guided fusion network (SAM-FNet), a dual-branch network for laryngo-pharyngeal tumor detection. By leveraging the powerful object segmentation capabilities of the Segment Anything Model (SAM), we introduce the SAM into the SAM-FNet to accurately segment the lesion region. Furthermore, we propose a GAN-like feature optimization (GFO) module to capture the discriminative features between the global and local branches, enhancing the fusion feature complementarity. Additionally, we collect two LPC datasets from the First Affiliated Hospital (FAHSYSU) and the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University. The FAHSYSU dataset is used as the internal dataset for training the model, while the SAHSYSU dataset is used as the external dataset for evaluating the model's performance. Extensive experiments on both datasets of FAHSYSU and SAHSYSU demonstrate that the SAM-FNet can achieve competitive results, outperforming the state-of-the-art counterparts. The source code of SAM-FNet is available at the URL of https://github.com/VVJia/SAM-FNet.
Paper Structure (20 sections, 10 equations, 4 figures, 4 tables)

This paper contains 20 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The architecture of the proposed SAM-FNet includes several key components: a SAM-guided lesion location (SLL) module to generate the lesion area image from the entire image; a global feature extractor (GFE) to extract features from the whole image; a local feature extractor (LFE) to derive features from the lesion region; a GAN-like feature optimization (GFO) module to align global and local features while differentiating their distributions; and a classifier that predicts based on global, local, and fused features.
  • Figure 2: Receiver Operating Characteristic (ROC) curves for experiments results on the FAHSYSU dataset: (a) Normal, (b) Benign, (c) Malignant. Our proposed SAM-FNet, represented by red line, achieves the best classification performance across all classes.
  • Figure 3: Illustrations of the Grad-CAM visualization for tumor images in both NBI and WLI modalities. Compared with other state-of-the-art counterparts, SAM-FNet is able to focus on and highlight effective tumor characteristics more precisely.
  • Figure 4: Illustrations of predicted lesion masks generated by the LoRA-based SAM within the SLL module in both NBI and WLI modalities. The predicted masks produced by our LoRA-based SAM demonstrate a high level of correspondence with the ground truth masks across these imaging modalities.