Table of Contents
Fetching ...

DeNAS-ViT: Data Efficient NAS-Optimized Vision Transformer for Ultrasound Image Segmentation

Renqi Chen, Xinzhe Zheng, Haoyang Su, Kehan Wu

TL;DR

This paper tackles ultrasound image segmentation under limited labeled data by introducing DeNAS-ViT, a data-efficient NAS-optimized Vision Transformer. It combines token-level NAS to enhance multi-scale feature learning with a NAS-guided constraint-driven SSL framework and a stage-wise optimization strategy to mitigate overfitting on small datasets. The method achieves state-of-the-art results on CAMUS, CETUS, and HMC-QU while requiring only modest NAS search cost, and it demonstrates promising generalization to other medical imaging domains. Overall, DeNAS-ViT offers a data-efficient, scalable approach to ultrasound segmentation with potential applicability beyond ultrasound imaging.

Abstract

Accurate segmentation of ultrasound images is essential for reliable medical diagnoses but is challenged by poor image quality and scarce labeled data. Prior approaches have relied on manually designed, complex network architectures to improve multi-scale feature extraction. However, such handcrafted models offer limited gains when prior knowledge is inadequate and are prone to overfitting on small datasets. In this paper, we introduce DeNAS-ViT, a data-efficient NAS-optimized Vision Transformer, the first method to leverage neural architecture search (NAS) for ultrasound image segmentation by automatically optimizing model architecture through token-level search. Specifically, we propose an efficient NAS module that performs multi-scale token search prior to the ViT's attention mechanism, effectively capturing both contextual and local features while minimizing computational costs. Given ultrasound's data scarcity and NAS's inherent data demands, we further develop a NAS-guided semi-supervised learning (SSL) framework. This approach integrates network independence and contrastive learning within a stage-wise optimization strategy, significantly enhancing model robustness under limited-data conditions. Extensive experiments on public datasets demonstrate that DeNAS-ViT achieves state-of-the-art performance, maintaining robustness with minimal labeled data. Moreover, we highlight DeNAS-ViT's generalization potential beyond ultrasound imaging, underscoring its broader applicability.

DeNAS-ViT: Data Efficient NAS-Optimized Vision Transformer for Ultrasound Image Segmentation

TL;DR

This paper tackles ultrasound image segmentation under limited labeled data by introducing DeNAS-ViT, a data-efficient NAS-optimized Vision Transformer. It combines token-level NAS to enhance multi-scale feature learning with a NAS-guided constraint-driven SSL framework and a stage-wise optimization strategy to mitigate overfitting on small datasets. The method achieves state-of-the-art results on CAMUS, CETUS, and HMC-QU while requiring only modest NAS search cost, and it demonstrates promising generalization to other medical imaging domains. Overall, DeNAS-ViT offers a data-efficient, scalable approach to ultrasound segmentation with potential applicability beyond ultrasound imaging.

Abstract

Accurate segmentation of ultrasound images is essential for reliable medical diagnoses but is challenged by poor image quality and scarce labeled data. Prior approaches have relied on manually designed, complex network architectures to improve multi-scale feature extraction. However, such handcrafted models offer limited gains when prior knowledge is inadequate and are prone to overfitting on small datasets. In this paper, we introduce DeNAS-ViT, a data-efficient NAS-optimized Vision Transformer, the first method to leverage neural architecture search (NAS) for ultrasound image segmentation by automatically optimizing model architecture through token-level search. Specifically, we propose an efficient NAS module that performs multi-scale token search prior to the ViT's attention mechanism, effectively capturing both contextual and local features while minimizing computational costs. Given ultrasound's data scarcity and NAS's inherent data demands, we further develop a NAS-guided semi-supervised learning (SSL) framework. This approach integrates network independence and contrastive learning within a stage-wise optimization strategy, significantly enhancing model robustness under limited-data conditions. Extensive experiments on public datasets demonstrate that DeNAS-ViT achieves state-of-the-art performance, maintaining robustness with minimal labeled data. Moreover, we highlight DeNAS-ViT's generalization potential beyond ultrasound imaging, underscoring its broader applicability.
Paper Structure (35 sections, 11 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 11 equations, 16 figures, 11 tables, 1 algorithm.

Figures (16)

  • Figure 1: An illustration comparing DeNAS-ViT with existing baselines for segmentation task. Use smiley, neutral, and sad faces to present the performance. "Design" denotes the optimization of model architecture for multi-scale feature extraction, "Data" denotes the robustness to limited data, and "Cost" denotes the computational resource consumption. $\alpha$ and $w$ present architecture and weights, respectively.
  • Figure 2: The pipeline of DeNAS-ViT. The hierarchical structure of NAS networks serves for the goal of multi-scale feature extraction enhancement, and multi-constraint SSL serves for the goal of robustness to limited data.
  • Figure 3: The proposed NAS backbone. (a) shows an overview of the NAS backbone, which consists of an encoder NAS and a decoder NAS, representing a module-level search. The input is passed through the encoder NAS to obtain multi-resolution feature maps, where a hierarchical encoder cell search is performed (shown in (c)). These outcomes are then processed by the decoder cells (shown in (b)), which concatenate features and complete the recovery process through further searching.
  • Figure 4: Search cost comparison with SOTA methods.
  • Figure 5: The impact of annotation proportions on SOTAs is evaluated using the CAMUS dataset, where DeNAS-ViT exhibits both robustness and effectiveness.
  • ...and 11 more figures