MAFM^3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI
Mohammad Areeb Qazi, Munachiso S Nwadike, Ibrahim Almakky, Mohammad Yaqub, Numan Saeed
TL;DR
MAFM^3 addresses data scarcity and modality variability in medical imaging by proposing a modular framework that extends a frozen foundation model with lightweight, selectively activatable components for classification, prognosis, and segmentation across CT, PET, and reports. The method combines within-model LoRA adapters, post-model decoders, and resolution-aware embeddings to enable cumulative, forgetting-free growth across tasks and modalities. Empirical results on the HECKTOR dataset show consistent gains in prognosis (C-index up to 0.721) and segmentation (Dice up to 65.7%) with modest parameter overhead, and improved robustness to domain shifts. The work suggests a practical path toward scalable, generalist medical AI that can evolve with clinical needs while minimizing retraining.
Abstract
Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at https://github.com/Areeb2735/CTscan_prognosis_VLM
