Table of Contents
Fetching ...

E2ENet: Dynamic Sparse Feature Fusion for Accurate and Efficient 3D Medical Image Segmentation

Boqian Wu, Qiao Xiao, Shiwei Liu, Lu Yin, Mykola Pechenizkiy, Decebal Constantin Mocanu, Maurice Van Keulen, Elena Mocanu

TL;DR

E2ENet addresses the need for accurate yet efficient 3D medical image segmentation by introducing Dynamic Sparse Feature Fusion (DSFF) and a Restricted Depth-Shift 3D convolution. DSFF learns sparse, data-adaptive cross-scale feature connections, while depth-shift enables 3D contextual reasoning with near-2D parameter costs. Across AMOS-CT, BraTS, and BTCV, E2ENet achieves competitive Dice scores with substantial reductions in parameters and FLOPs, and ablative analyses confirm the value of DSFF and depth-shift. The method demonstrates good generalizability under domain shifts (CT to MRI) and maintains effectiveness as model capacity is adjusted. Together, these components offer a plug-and-play path toward practical, resource-efficient 3D medical segmentation on limited hardware.

Abstract

Deep neural networks have evolved as the leading approach in 3D medical image segmentation due to their outstanding performance. However, the ever-increasing model size and computation cost of deep neural networks have become the primary barrier to deploying them on real-world resource-limited hardware. In pursuit of improving performance and efficiency, we propose a 3D medical image segmentation model, named Efficient to Efficient Network (E2ENet), incorporating two parametrically and computationally efficient designs. i. Dynamic sparse feature fusion (DSFF) mechanism: it adaptively learns to fuse informative multi-scale features while reducing redundancy. ii. Restricted depth-shift in 3D convolution: it leverages the 3D spatial information while keeping the model and computational complexity as 2D-based methods. We conduct extensive experiments on BTCV, AMOS-CT and Brain Tumor Segmentation Challenge, demonstrating that E2ENet consistently achieves a superior trade-off between accuracy and efficiency than prior arts across various resource constraints. E2ENet achieves comparable accuracy on the large-scale challenge AMOS-CT, while saving over 68\% parameter count and 29\% FLOPs in the inference phase, compared with the previous best-performing method. Our code has been made available at: https://github.com/boqian333/E2ENet-Medical.

E2ENet: Dynamic Sparse Feature Fusion for Accurate and Efficient 3D Medical Image Segmentation

TL;DR

E2ENet addresses the need for accurate yet efficient 3D medical image segmentation by introducing Dynamic Sparse Feature Fusion (DSFF) and a Restricted Depth-Shift 3D convolution. DSFF learns sparse, data-adaptive cross-scale feature connections, while depth-shift enables 3D contextual reasoning with near-2D parameter costs. Across AMOS-CT, BraTS, and BTCV, E2ENet achieves competitive Dice scores with substantial reductions in parameters and FLOPs, and ablative analyses confirm the value of DSFF and depth-shift. The method demonstrates good generalizability under domain shifts (CT to MRI) and maintains effectiveness as model capacity is adjusted. Together, these components offer a plug-and-play path toward practical, resource-efficient 3D medical segmentation on limited hardware.

Abstract

Deep neural networks have evolved as the leading approach in 3D medical image segmentation due to their outstanding performance. However, the ever-increasing model size and computation cost of deep neural networks have become the primary barrier to deploying them on real-world resource-limited hardware. In pursuit of improving performance and efficiency, we propose a 3D medical image segmentation model, named Efficient to Efficient Network (E2ENet), incorporating two parametrically and computationally efficient designs. i. Dynamic sparse feature fusion (DSFF) mechanism: it adaptively learns to fuse informative multi-scale features while reducing redundancy. ii. Restricted depth-shift in 3D convolution: it leverages the 3D spatial information while keeping the model and computational complexity as 2D-based methods. We conduct extensive experiments on BTCV, AMOS-CT and Brain Tumor Segmentation Challenge, demonstrating that E2ENet consistently achieves a superior trade-off between accuracy and efficiency than prior arts across various resource constraints. E2ENet achieves comparable accuracy on the large-scale challenge AMOS-CT, while saving over 68\% parameter count and 29\% FLOPs in the inference phase, compared with the previous best-performing method. Our code has been made available at: https://github.com/boqian333/E2ENet-Medical.
Paper Structure (38 sections, 7 equations, 15 figures, 11 tables)

This paper contains 38 sections, 7 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: A comparison of feature fusion schemes. The purple nodes depict features extracted from the backbone, while the green nodes depict the fused features. In particular, in DiNTS (d), red lines indicate information flow paths determined through neural architecture search techniques. In E2ENet (e), the red lines with different widths represent sparse information flows determined by the DSFF mechanism, allowing for efficient feature fusion. E2ENet is capable of dynamically learning how many of the features to fuse are derived from the backbone.
  • Figure 2: The overall architecture of the proposed E2ENet consists of a CNN backbone that extracts multiple levels of features. These features are then gradually aggregated through several stages, during which the multi-scale features are fused using a fusion operation.
  • Figure 3: Illustration of our Dynamic Sparse Feature Fusion (DSFF) mechanism. The fusion operation starts from sparse feature connections and allows the connectivity to be evolved after training for $\Delta T$ epochs. During each evolution stage, a fraction of kernels with smaller $L_1$ norms will be zeroed out (red dotted line), while the same fraction of other inactivated connections will be reactivated randomly, keeping the feature sparsity $S$ constant during training (blue solid line).
  • Figure 4: Illustration of restricted depth-shift in 3D Convolution of our E2ENet. The input features (left) are firstly split into 3 parts along the channel dimension, and then shifted by $\{ -1, 0, 1\}$ units along the depth dimension respectively (middle). After that, 3D CNNs with kernel size 1$\times$3$\times$3 are performed on the feature maps (middle) to generate the output features (right).
  • Figure 5: Qualitative comparison of E2ENet and nnUNet on the AMOS-CT challenges.
  • ...and 10 more figures