Table of Contents
Fetching ...

Flemme: A Flexible and Modular Learning Platform for Medical Images

Guoqing Zhang, Jingyun Yang, Yang Li

TL;DR

Flemme addresses the challenge of building versatile, high-performance medical image models across diverse modalities and dataset sizes by decoupling encoders from architectures. It introduces a modular framework leveraging CNN, vision transformer, and state-space model backbones within an encoder–decoder paradigm, augmented by a context-embedded, hierarchical pyramid architecture for vertical feature fusion. The key contributions include a backbone-agnostic design, a hierarchy-based refinement mechanism (pyramid loss), and comprehensive experiments across segmentation, reconstruction, and generation that demonstrate improved accuracy and competitive efficiency. This platform enables rapid, fair comparisons of encoders and supports multi-task medical imaging research, with practical implications for scalable model development and deployment.

Abstract

As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at https://github.com/wlsdzyzl/flemme.

Flemme: A Flexible and Modular Learning Platform for Medical Images

TL;DR

Flemme addresses the challenge of building versatile, high-performance medical image models across diverse modalities and dataset sizes by decoupling encoders from architectures. It introduces a modular framework leveraging CNN, vision transformer, and state-space model backbones within an encoder–decoder paradigm, augmented by a context-embedded, hierarchical pyramid architecture for vertical feature fusion. The key contributions include a backbone-agnostic design, a hierarchy-based refinement mechanism (pyramid loss), and comprehensive experiments across segmentation, reconstruction, and generation that demonstrate improved accuracy and competitive efficiency. This platform enables rapid, fair comparisons of encoders and supports multi-task medical imaging research, with practical implications for scalable model development and deployment.

Abstract

As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at https://github.com/wlsdzyzl/flemme.
Paper Structure (23 sections, 14 equations, 6 figures, 3 tables)

This paper contains 23 sections, 14 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A semantic overview of Flemme. The left box gives 3 examples of building blocks based on convolution, transformer, and SSM. Encoders and architectures are shown in the middle and right blocks. Different models can be constructed efficiently from various combinations of encoders and architectures labeled with the corresponding colors.
  • Figure 2: Pipelines of encoder and decoder. Components enclosed in dotted boxes indicate optional elements.
  • Figure 3: Illustration of supported architectures: (a) SeM, (b) AE, (d) DDPM. The dashed lines indicate optional paths.
  • Figure 4: A segmentation model constructed with Hierarchical SeM (H-SeM) and a U-shaped encoder using ConvBlock.
  • Figure 5: Quantitative results of segmentation models. The top four rows show segmentation results for 2D image datasets: CVC-ClinicDB, Echonet, ISIC, and TN3K. The bottom two rows show segmentation results of the middle slices for 3D image datasets: BraTS21 and ImageCAS. The pixels highlighted in red represent incorrect predictions.
  • ...and 1 more figures