Table of Contents
Fetching ...

A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation

Prateek Singh, Moumita Dholey, P. K. Vinod

TL;DR

The paper addresses the problem of accurate lesion segmentation in breast ultrasound images under challenging noise and boundary conditions. It proposes a ViT-conditioned diffusion framework with dual modes: a full high-fidelity sampling mode for large datasets and a lightweight auxiliary-inference mode for small datasets, where the diffusion objective acts as a strong regularizer guiding the backbone. Key innovations include an Adaptive Conditioning Bridge that fuses multi-scale ViT features into a conditional UNet and a Topological Denoising Consistency loss that penalizes topological changes across denoising steps via the 1-Wasserstein distance between Persistence Diagrams. Empirical results show state-of-the-art Dice scores on BUSI (0.96), BrEaST (0.90), and BUS-UCLM (0.97), and robust cross-modality generalization to REFUGE2, ISIC2018, and BraTS, with ablation studies confirming the contributions. The work offers a practical, computationally efficient path toward anatomically plausible, clinically applicable segmentation across imaging modalities.

Abstract

In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.

A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation

TL;DR

The paper addresses the problem of accurate lesion segmentation in breast ultrasound images under challenging noise and boundary conditions. It proposes a ViT-conditioned diffusion framework with dual modes: a full high-fidelity sampling mode for large datasets and a lightweight auxiliary-inference mode for small datasets, where the diffusion objective acts as a strong regularizer guiding the backbone. Key innovations include an Adaptive Conditioning Bridge that fuses multi-scale ViT features into a conditional UNet and a Topological Denoising Consistency loss that penalizes topological changes across denoising steps via the 1-Wasserstein distance between Persistence Diagrams. Empirical results show state-of-the-art Dice scores on BUSI (0.96), BrEaST (0.90), and BUS-UCLM (0.97), and robust cross-modality generalization to REFUGE2, ISIC2018, and BraTS, with ablation studies confirming the contributions. The work offers a practical, computationally efficient path toward anatomically plausible, clinically applicable segmentation across imaging modalities.

Abstract

In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.

Paper Structure

This paper contains 15 sections, 9 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Architecture Overview
  • Figure 2: Hybrid vs Enhanced loss with TDC component predictions on BUSI(Red-Pred, Green-GT).