Table of Contents
Fetching ...

CLIP-TNseg: A Multi-Modal Hybrid Framework for Thyroid Nodule Segmentation in Ultrasound Images

Xinjie Sun, Boxiong Wei, Yalong Jiang, Liquan Mao, Qi Zhao

TL;DR

This work tackles the challenging task of thyroid nodule segmentation in ultrasound images, where speckle noise and artifacts hinder accuracy. It introduces CLIP-TNseg, a two-branch architecture that fuses a Coarse-grained Branch leveraging a frozen CLIP backbone with FiLM-based cross-modal fusion and a Fine-grained Branch built on U-Net residual blocks, with a Prediction Head to output pixel-level maps. The authors construct a large PKTN dataset and validate on public data and TN3K, reporting state-of-the-art IoU and Dice scores (IoU up to 86.85% and Dice up to 91.91% on the comprehensive dataset) and strong generalization to TN3K. Ablation studies confirm that both branches contribute meaningfully, underscoring the benefit of integrating multimodal semantic guidance with detailed spatial refinement. The approach demonstrates the potential of multimodal large models for robust, interpretable medical image segmentation and could extend to other ultrasound-based diagnostic tasks.

Abstract

Thyroid nodule segmentation in ultrasound images is crucial for accurate diagnosis and treatment planning. However, existing methods face challenges in segmentation accuracy, interpretability, and generalization, which hinder their performance. This letter proposes a novel framework, CLIP-TNseg, to address these issues by integrating a multimodal large model with a neural network architecture. CLIP-TNseg consists of two main branches: the Coarse-grained Branch, which extracts high-level semantic features from a frozen CLIP model, and the Fine-grained Branch, which captures fine-grained features using U-Net style residual blocks. These features are fused and processed by the prediction head to generate precise segmentation maps. CLIP-TNseg leverages the Coarse-grained Branch to enhance semantic understanding through textual and high-level visual features, while the Fine-grained Branch refines spatial details, enabling precise and robust segmentation. Extensive experiments on public and our newly collected datasets demonstrate its competitive performance. Our code and the original dataset are available at https://github.com/jayxjsun/CLIP-TNseg.

CLIP-TNseg: A Multi-Modal Hybrid Framework for Thyroid Nodule Segmentation in Ultrasound Images

TL;DR

This work tackles the challenging task of thyroid nodule segmentation in ultrasound images, where speckle noise and artifacts hinder accuracy. It introduces CLIP-TNseg, a two-branch architecture that fuses a Coarse-grained Branch leveraging a frozen CLIP backbone with FiLM-based cross-modal fusion and a Fine-grained Branch built on U-Net residual blocks, with a Prediction Head to output pixel-level maps. The authors construct a large PKTN dataset and validate on public data and TN3K, reporting state-of-the-art IoU and Dice scores (IoU up to 86.85% and Dice up to 91.91% on the comprehensive dataset) and strong generalization to TN3K. Ablation studies confirm that both branches contribute meaningfully, underscoring the benefit of integrating multimodal semantic guidance with detailed spatial refinement. The approach demonstrates the potential of multimodal large models for robust, interpretable medical image segmentation and could extend to other ultrasound-based diagnostic tasks.

Abstract

Thyroid nodule segmentation in ultrasound images is crucial for accurate diagnosis and treatment planning. However, existing methods face challenges in segmentation accuracy, interpretability, and generalization, which hinder their performance. This letter proposes a novel framework, CLIP-TNseg, to address these issues by integrating a multimodal large model with a neural network architecture. CLIP-TNseg consists of two main branches: the Coarse-grained Branch, which extracts high-level semantic features from a frozen CLIP model, and the Fine-grained Branch, which captures fine-grained features using U-Net style residual blocks. These features are fused and processed by the prediction head to generate precise segmentation maps. CLIP-TNseg leverages the Coarse-grained Branch to enhance semantic understanding through textual and high-level visual features, while the Fine-grained Branch refines spatial details, enabling precise and robust segmentation. Extensive experiments on public and our newly collected datasets demonstrate its competitive performance. Our code and the original dataset are available at https://github.com/jayxjsun/CLIP-TNseg.

Paper Structure

This paper contains 12 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overall architecture of our CLIP-TNseg, with three main components: CGB for extracting high-level semantic features from pre-trained CLIP models, FGB for capturing fine-grained features with residual learning, and PH for generating final segmentation maps.
  • Figure 2: Visual comparison of thyroid nodules segmentation results using different methods, from left to right: Input Image, Ground Truth, Ours (CLIP-TNseg), LViT, TGANet, CLIPSeg, Attention U-Net, U-Net, and FCN.