ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

Ruohua Shi; Qiufan Pang; Lei Ma; Lingyu Duan; Tiejun Huang; Tingting Jiang

ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

Ruohua Shi, Qiufan Pang, Lei Ma, Lingyu Duan, Tiejun Huang, Tingting Jiang

TL;DR

ShapeMamba-EM tackles the domain gap in EM segmentation by fine-tuning a 3D medical foundation model (SAM-Med3D) with FacT for efficient encoder adaptation, and by adding a 3D Mamba Adapter for long-range dependencies and a 3D Local Shape Descriptor Encoder to capture local morphology. The LSD is a 10-dimensional per-voxel embedding defined by $lsd^{y}(v)=( s(S_v), m(S_v) - v, c(S_v) )$ with $S_v=\{ v' ∈ Ω \mid y(v)=y(v'), ∥v - v'∥_2^2 ≤ σ \}$, where $s(S_v)=|S_v|$, $m(S_v)=\frac{1}{|S_v|}\sum_{v'∈S_v} v'$, and $c(S_v)=\frac{1}{|S_v|}\sum_{v'∈S_v} (v' - m(S_v))(v' - m(S_v))^T$. A 3D U-Net learns $lsd^{x}: Ω → R^{10}$ from raw data and provides it as input to the Emage Decoder, while the prompt encoder is discarded and the decoder is fully fine-tuned. Evaluated on 10 EM datasets across five tasks, ShapeMamba-EM outperforms state-of-the-art baselines and several SAM-finetuning strategies, demonstrating improved segmentation accuracy and efficiency for high-resolution neural tissue analysis. The results establish ShapeMamba-EM as a versatile and scalable approach for leveraging medical foundation models in 3D EM segmentation, with potential to advance connectomics and neuroscience research.

Abstract

Electron microscopy (EM) imaging offers unparalleled resolution for analyzing neural tissues, crucial for uncovering the intricacies of synaptic connections and neural processes fundamental to understanding behavioral mechanisms. Recently, the foundation models have demonstrated impressive performance across numerous natural and medical image segmentation tasks. However, applying these foundation models to EM segmentation faces significant challenges due to domain disparities. This paper presents ShapeMamba-EM, a specialized fine-tuning method for 3D EM segmentation, which employs adapters for long-range dependency modeling and an encoder for local shape description within the original foundation model. This approach effectively addresses the unique volumetric and morphological complexities of EM data. Tested over a wide range of EM images, covering five segmentation tasks and 10 datasets, ShapeMamba-EM outperforms existing methods, establishing a new standard in EM image segmentation and enhancing the understanding of neural tissue architecture.

ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

TL;DR

with

, where

, and

. A 3D U-Net learns

from raw data and provides it as input to the Emage Decoder, while the prompt encoder is discarded and the decoder is fully fine-tuned. Evaluated on 10 EM datasets across five tasks, ShapeMamba-EM outperforms state-of-the-art baselines and several SAM-finetuning strategies, demonstrating improved segmentation accuracy and efficiency for high-resolution neural tissue analysis. The results establish ShapeMamba-EM as a versatile and scalable approach for leveraging medical foundation models in 3D EM segmentation, with potential to advance connectomics and neuroscience research.

Abstract

Paper Structure (13 sections, 3 equations, 4 figures, 2 tables)

This paper contains 13 sections, 3 equations, 4 figures, 2 tables.

Introduction
Method
Overview
SAM-Med3D
Parameter-efficient fine-tuning of 3D image encoder
3D Mamba Adapter
3D Local Shape Descriptor Encoder
Experiments and Results
Datasets and Experimental Settings
Quantitative and qualitative segmentation results
Discusion and Conclusion
Acknowledgments.
Disclosure of Interests.

Figures (4)

Figure 1: Illustration of the medical data and EM data. (a) MRI edema image from the BraTS2021 dataset. (b) EM mitochondria data from MitoEM-R dataset. (c) 3D and 2D segmentation results of (b). The boundaries of instances share a similar local shape, and the scope of the instance spans the entire volume.
Figure 2: The overall architecture of ShapeMamba-EM. The image encoder is updated with FacT. The volumetric or temporal information is effectively incorporated via a set of 3D Mamba adapters. The mask decoder is fully fine-tuned and modified to recover the prediction resolution. The LSDs are trained by the 3D U-Net network.
Figure 3: Visualizations of LSDs for different segmentation tasks. From left to right: the EM image, segmentation labels, and components of the LSDs.
Figure 4: The visualization of segmentation results.

ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

TL;DR

Abstract

ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)