Flemme: A Flexible and Modular Learning Platform for Medical Images
Guoqing Zhang, Jingyun Yang, Yang Li
TL;DR
Flemme addresses the challenge of building versatile, high-performance medical image models across diverse modalities and dataset sizes by decoupling encoders from architectures. It introduces a modular framework leveraging CNN, vision transformer, and state-space model backbones within an encoder–decoder paradigm, augmented by a context-embedded, hierarchical pyramid architecture for vertical feature fusion. The key contributions include a backbone-agnostic design, a hierarchy-based refinement mechanism (pyramid loss), and comprehensive experiments across segmentation, reconstruction, and generation that demonstrate improved accuracy and competitive efficiency. This platform enables rapid, fair comparisons of encoders and supports multi-task medical imaging research, with practical implications for scalable model development and deployment.
Abstract
As the rapid development of computer vision and the emergence of powerful network backbones and architectures, the application of deep learning in medical imaging has become increasingly significant. Unlike natural images, medical images lack huge volumes of data but feature more modalities, making it difficult to train a general model that has satisfactory performance across various datasets. In practice, practitioners often suffer from manually creating and testing models combining independent backbones and architectures, which is a laborious and time-consuming process. We propose Flemme, a FLExible and Modular learning platform for MEdical images. Our platform separates encoders from the model architectures so that different models can be constructed via various combinations of supported encoders and architectures. We construct encoders using building blocks based on convolution, transformer, and state-space model (SSM) to process both 2D and 3D image patches. A base architecture is implemented following an encoder-decoder style, with several derived architectures for image segmentation, reconstruction, and generation tasks. In addition, we propose a general hierarchical architecture incorporating a pyramid loss to optimize and fuse vertical features. Experiments demonstrate that this simple design leads to an average improvement of 5.60% in Dice score and 7.81% in mean interaction of units (mIoU) for segmentation models, as well as an enhancement of 5.57% in peak signal-to-noise ratio (PSNR) and 8.22% in structural similarity (SSIM) for reconstruction models. We further utilize Flemme as an analytical tool to assess the effectiveness and efficiency of various encoders across different tasks. Code is available at https://github.com/wlsdzyzl/flemme.
