Table of Contents
Fetching ...

DB-SAM: Delving into High Quality Universal Medical Image Segmentation

Chao Qin, Jiale Cao, Huazhu Fu, Fahad Shahbaz Khan, Rao Muhammad Anwer

TL;DR

This work proposes a dual-branch adapted SAM framework, named DB-SAM, that strives to effectively bridge the domain gap between natural and 2D/3D medical data and achieves an absolute gain of 8.8%, compared to a recent medical SAM adapter in the literature.

Abstract

Recently, the Segment Anything Model (SAM) has demonstrated promising segmentation capabilities in a variety of downstream segmentation tasks. However in the context of universal medical image segmentation there exists a notable performance discrepancy when directly applying SAM due to the domain gap between natural and 2D/3D medical data. In this work, we propose a dual-branch adapted SAM framework, named DB-SAM, that strives to effectively bridge this domain gap. Our dual-branch adapted SAM contains two branches in parallel: a ViT branch and a convolution branch. The ViT branch incorporates a learnable channel attention block after each frozen attention block, which captures domain-specific local features. On the other hand, the convolution branch employs a light-weight convolutional block to extract domain-specific shallow features from the input medical image. To perform cross-branch feature fusion, we design a bilateral cross-attention block and a ViT convolution fusion block, which dynamically combine diverse information of two branches for mask decoder. Extensive experiments on large-scale medical image dataset with various 3D and 2D medical segmentation tasks reveal the merits of our proposed contributions. On 21 3D medical image segmentation tasks, our proposed DB-SAM achieves an absolute gain of 8.8%, compared to a recent medical SAM adapter in the literature. The code and model are available at https://github.com/AlfredQin/DB-SAM.

DB-SAM: Delving into High Quality Universal Medical Image Segmentation

TL;DR

This work proposes a dual-branch adapted SAM framework, named DB-SAM, that strives to effectively bridge the domain gap between natural and 2D/3D medical data and achieves an absolute gain of 8.8%, compared to a recent medical SAM adapter in the literature.

Abstract

Recently, the Segment Anything Model (SAM) has demonstrated promising segmentation capabilities in a variety of downstream segmentation tasks. However in the context of universal medical image segmentation there exists a notable performance discrepancy when directly applying SAM due to the domain gap between natural and 2D/3D medical data. In this work, we propose a dual-branch adapted SAM framework, named DB-SAM, that strives to effectively bridge this domain gap. Our dual-branch adapted SAM contains two branches in parallel: a ViT branch and a convolution branch. The ViT branch incorporates a learnable channel attention block after each frozen attention block, which captures domain-specific local features. On the other hand, the convolution branch employs a light-weight convolutional block to extract domain-specific shallow features from the input medical image. To perform cross-branch feature fusion, we design a bilateral cross-attention block and a ViT convolution fusion block, which dynamically combine diverse information of two branches for mask decoder. Extensive experiments on large-scale medical image dataset with various 3D and 2D medical segmentation tasks reveal the merits of our proposed contributions. On 21 3D medical image segmentation tasks, our proposed DB-SAM achieves an absolute gain of 8.8%, compared to a recent medical SAM adapter in the literature. The code and model are available at https://github.com/AlfredQin/DB-SAM.
Paper Structure (11 sections, 2 equations, 2 figures, 3 tables)

This paper contains 11 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: (a) Overall architecture of our DB-SAM. Our DB-SAM contains two branches: one ViT branch and one convolution branch. The ViT branch incorporates channel attention block (b) to capture domain-specific high-level features, while the convolution branch adopts light-weight convolution blocks to extract shallow features. For cross-branch fusion, we introduce a bilateral cross-attention operation (c) and ViT-Conv fusion module (d) to adaptively combine the features. Finally, the fused features and prompt embeddings are fed to mask decoder.
  • Figure 2: Visualization examples of the pre-trained SAM, MedSAM, our model and GT on different 3D and 2D tasks. Our DB-SAM model achieves more accurate segmentation than the SAM and MedSAM, especially in scenarios involving small organs and organs with complex shapes. Best viewed zoomed in. Additional results are presented in the supplementary material.