Table of Contents
Fetching ...

PEMMA: Parameter-Efficient Multi-Modal Adaptation for Medical Image Segmentation

Nada Saadi, Numan Saeed, Mohammad Yaqub, Karthik Nandakumar

TL;DR

The paper addresses the challenge of leveraging PET information for tumor segmentation when PET data may be unavailable during training or inference. It introduces PEMMA, a parameter-efficient multimodal adaptation framework that freezes a CT-trained transformer model and adds a PET pathway implemented with visual prompts, low-rank adaptation (LoRA) to attention, and a parallel skip connection to minimize cross-modal entanglement. PEMMA achieves comparable accuracy to early fusion while drastically reducing trainable parameters, and it delivers notable improvements in PET-related Dice scores on unseen datasets, illustrating strong performance with single-modality fine-tuning and robustness to modality availability. The work demonstrates practical benefits for continual learning and proposes extension to other imaging modalities like MRI, highlighting its potential impact on flexible, efficient multimodal medical image segmentation.

Abstract

Imaging modalities such as Computed Tomography (CT) and Positron Emission Tomography (PET) are key in cancer detection, inspiring Deep Neural Networks (DNN) models that merge these scans for tumor segmentation. When both CT and PET scans are available, it is common to combine them as two channels of the input to the segmentation model. However, this method requires both scan types during training and inference, posing a challenge due to the limited availability of PET scans, thereby sometimes limiting the process to CT scans only. Hence, there is a need to develop a flexible DNN architecture that can be trained/updated using only CT scans but can effectively utilize PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans to also incorporate PET scans. The benefits of the proposed approach are two-fold. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, since the PEMMA framework attempts to minimize cross modal entanglement, it is possible to subsequently update the combined model using only one modality, without causing catastrophic forgetting of the other modality. Our proposed method achieves comparable results with the performance of early fusion techniques with just 8% of the trainable parameters, especially with a remarkable +28% improvement on the average dice score on PET scans when trained on a single modality.

PEMMA: Parameter-Efficient Multi-Modal Adaptation for Medical Image Segmentation

TL;DR

The paper addresses the challenge of leveraging PET information for tumor segmentation when PET data may be unavailable during training or inference. It introduces PEMMA, a parameter-efficient multimodal adaptation framework that freezes a CT-trained transformer model and adds a PET pathway implemented with visual prompts, low-rank adaptation (LoRA) to attention, and a parallel skip connection to minimize cross-modal entanglement. PEMMA achieves comparable accuracy to early fusion while drastically reducing trainable parameters, and it delivers notable improvements in PET-related Dice scores on unseen datasets, illustrating strong performance with single-modality fine-tuning and robustness to modality availability. The work demonstrates practical benefits for continual learning and proposes extension to other imaging modalities like MRI, highlighting its potential impact on flexible, efficient multimodal medical image segmentation.

Abstract

Imaging modalities such as Computed Tomography (CT) and Positron Emission Tomography (PET) are key in cancer detection, inspiring Deep Neural Networks (DNN) models that merge these scans for tumor segmentation. When both CT and PET scans are available, it is common to combine them as two channels of the input to the segmentation model. However, this method requires both scan types during training and inference, posing a challenge due to the limited availability of PET scans, thereby sometimes limiting the process to CT scans only. Hence, there is a need to develop a flexible DNN architecture that can be trained/updated using only CT scans but can effectively utilize PET scans when they become available. In this work, we propose a parameter-efficient multi-modal adaptation (PEMMA) framework for lightweight upgrading of a transformer-based segmentation model trained only on CT scans to also incorporate PET scans. The benefits of the proposed approach are two-fold. Firstly, we leverage the inherent modularity of the transformer architecture and perform low-rank adaptation (LoRA) of the attention weights to achieve parameter-efficient adaptation. Secondly, since the PEMMA framework attempts to minimize cross modal entanglement, it is possible to subsequently update the combined model using only one modality, without causing catastrophic forgetting of the other modality. Our proposed method achieves comparable results with the performance of early fusion techniques with just 8% of the trainable parameters, especially with a remarkable +28% improvement on the average dice score on PET scans when trained on a single modality.
Paper Structure (8 sections, 2 equations, 1 figure, 2 tables)

This paper contains 8 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of our proposed architecture PEMMA: At the input level, we separate the path for CT and PET by adding the PET Skip Connection $\theta_{\textrm{SK}}^{P}$. We freeze both the encoder and decoder part of the base UNetr model and introduce LoRA, after each ViT block (x12) as the only trainable layers.