Table of Contents
Fetching ...

Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine

Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik

TL;DR

The paper addresses the challenges of multimodal learning in biomedicine—diverse data distributions, missing modalities, and the need for scalable, topology-agnostic fusion—by introducing MM-Lego, a modular framework built from LegoBlocks that wrap unimodal encoders and operate in the frequency domain. It presents two core approaches: LegoMerge, which merges unimodal models without any multimodal training, and LegoFuse, which enables limited fine-tuning to achieve strong performance. Across seven biomedical datasets, LegoMerge delivers competitive results without end-to-end training, while LegoFuse attains state-of-the-art performance on most tasks, demonstrating robust handling of modality imbalance and missing data. Collectively, MM-Lego offers a scalable, modular, and data-efficient path to integrating heterogeneous modalities beyond vision-language, with practical impact for biomedical analysis and beyond.

Abstract

Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We achieve this by introducing a wrapper for any unimodal encoder that enforces shape consistency between modality representations. It harmonises these representations by learning features in the frequency domain to enable model merging with little signal interference. We show that MM-Lego 1) can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning, 2) can operate on any unimodal encoder, and 3) is a model fusion method that, with minimal fine-tuning, surpasses all benchmarks in five out of seven datasets.

Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine

TL;DR

The paper addresses the challenges of multimodal learning in biomedicine—diverse data distributions, missing modalities, and the need for scalable, topology-agnostic fusion—by introducing MM-Lego, a modular framework built from LegoBlocks that wrap unimodal encoders and operate in the frequency domain. It presents two core approaches: LegoMerge, which merges unimodal models without any multimodal training, and LegoFuse, which enables limited fine-tuning to achieve strong performance. Across seven biomedical datasets, LegoMerge delivers competitive results without end-to-end training, while LegoFuse attains state-of-the-art performance on most tasks, demonstrating robust handling of modality imbalance and missing data. Collectively, MM-Lego offers a scalable, modular, and data-efficient path to integrating heterogeneous modalities beyond vision-language, with practical impact for biomedical analysis and beyond.

Abstract

Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We achieve this by introducing a wrapper for any unimodal encoder that enforces shape consistency between modality representations. It harmonises these representations by learning features in the frequency domain to enable model merging with little signal interference. We show that MM-Lego 1) can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning, 2) can operate on any unimodal encoder, and 3) is a model fusion method that, with minimal fine-tuning, surpasses all benchmarks in five out of seven datasets.
Paper Structure (14 sections, 4 equations, 7 figures, 3 tables)

This paper contains 14 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The Multimodal Lego workflow to turn a set of encoders into a performant multimodal model. LegoBlock (1) makes unimodal encoders compatible with model merging techniques by learning a latent representation in the frequency-domain to prevent signal interference effects upon aggregation. Any set of LegoBlocks can be merged into a multimodal model without any fine-tuning (LegoMerge (2a)) or with minimal fine-tuning to achieve state-of-the-art performance (LegoFuse (2b)).
  • Figure 2: Frequency-domain state passing in LegoBlock. The latent bottleneck $L_0$ is randomly initialized as a model parameter at the start of training and iteratively updated by each pass through the LegoBlocks. The real components of the FFT $(L_0^{\mathcal{F}})^r$ and $\mathcal{F}(h^{(A)})^r$ are used in the cross-attention update, and the imaginary component $(L_0^{\mathcal{F}})^r$ is used for reconstruction.
  • Figure 3: Mean task performance (concordance Index/AUC) of LegoBlock (Tabular), LegoBlock (Image/Time Series) and LegoMerge, showing the increase in task performance by applying a multimodal model merge without any fine-tuning. Our proposed multimodal model merge shows a positive performance improvement on 6 out of 7 datasets.
  • Figure 4: AUC performance on the MIMIC dataset when merging existing encoders (SNN for tabular, AMIL for Time Series) using LegoMerge and LegoFuse. Our multimodal model merge shows much better performance than using an ensemble, exhibiting the performance gains, at no additional costs, through the merge even prior to fine-tuning in LegoFuse.
  • Figure 5: Example of signal interference on a random normal latent variable and its additive inverse variable with some added noise, showcasing a severe case of signal interference where nearly all signal cancels out. We can see that the fourier-transformed data does not suffer this problem when we apply the harmonic mean. This is a key reason for the choice of model merging architecture.
  • ...and 2 more figures