Table of Contents
Fetching ...

MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language

Yoel Shoshan, Moshiko Raboh, Michal Ozery-Flato, Vadim Ratner, Alex Golts, Jeffrey K. Weber, Ella Barkan, Simona Rabinovici-Cohen, Sagi Polaczek, Ido Amos, Ben Shapira, Liam Hazan, Matan Ninio, Sivan Ravid, Michael M. Danziger, Yosi Shamay, Sharon Kurant, Joseph A. Morrone, Parthasarathy Suryanarayanan, Michal Rosen-Zvi, Efrat Hexter

TL;DR

MAMMAL introduces Molecular Aligned Multi-Modal Architecture and Language, a cross-domain foundation model that unifies proteins, small molecules, and transcriptomic data within a single encoder–decoder Transformer framework. It employs a structured, multi-domain prompt syntax and continuous scalar embeddings to support classification, regression, and generation tasks across the drug discovery pipeline, pretrained on ~$2$ billion samples from six public datasets. Across $11$ downstream benchmarks, MAMMAL achieves state-of-the-art results on $9$ tasks and remains competitive on the remaining two, demonstrating strong cross-domain transfer and task versatility. Comparative analyses with AlphaFold3 on antibody–antigen and nanobody–antigen binding show MAMMAL provides superior classification in most targets, underscoring the value of integrated, sequence-based cross-domain learning for predictive design in biomedicine. The work provides open code and pretrained weights to facilitate replication and further development in cross-domain biomedical AI.

Abstract

Large language models applied to vast biological datasets have the potential to transform biology by uncovering disease mechanisms and accelerating drug development. However, current models are often siloed, trained separately on small-molecules, proteins, or transcriptomic data, limiting their ability to capture complex, multi-modal interactions. Effective drug discovery requires computational tools that integrate multiple biological entities while supporting prediction and generation, a challenge existing models struggle to address. For this purpose, we present MAMMAL - Molecular Aligned Multi-Modal Architecture and Language - a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities, including proteins, small-molecules, and omics. MAMMAL's structured prompt syntax supports classification, regression, and generation tasks while handling token and scalar inputs and outputs. Evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks, all within a unified architecture, unlike prior task-specific models. Additionally, we explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets. The model code and pretrained weights are publicly available at https://github.com/BiomedSciAI/biomed-multi-alignment and https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m

MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language

TL;DR

MAMMAL introduces Molecular Aligned Multi-Modal Architecture and Language, a cross-domain foundation model that unifies proteins, small molecules, and transcriptomic data within a single encoder–decoder Transformer framework. It employs a structured, multi-domain prompt syntax and continuous scalar embeddings to support classification, regression, and generation tasks across the drug discovery pipeline, pretrained on ~ billion samples from six public datasets. Across downstream benchmarks, MAMMAL achieves state-of-the-art results on tasks and remains competitive on the remaining two, demonstrating strong cross-domain transfer and task versatility. Comparative analyses with AlphaFold3 on antibody–antigen and nanobody–antigen binding show MAMMAL provides superior classification in most targets, underscoring the value of integrated, sequence-based cross-domain learning for predictive design in biomedicine. The work provides open code and pretrained weights to facilitate replication and further development in cross-domain biomedical AI.

Abstract

Large language models applied to vast biological datasets have the potential to transform biology by uncovering disease mechanisms and accelerating drug development. However, current models are often siloed, trained separately on small-molecules, proteins, or transcriptomic data, limiting their ability to capture complex, multi-modal interactions. Effective drug discovery requires computational tools that integrate multiple biological entities while supporting prediction and generation, a challenge existing models struggle to address. For this purpose, we present MAMMAL - Molecular Aligned Multi-Modal Architecture and Language - a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities, including proteins, small-molecules, and omics. MAMMAL's structured prompt syntax supports classification, regression, and generation tasks while handling token and scalar inputs and outputs. Evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks, all within a unified architecture, unlike prior task-specific models. Additionally, we explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets. The model code and pretrained weights are publicly available at https://github.com/BiomedSciAI/biomed-multi-alignment and https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m

Paper Structure

This paper contains 31 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (A) We introduce a multi-align model pretrained on six datasets, each containing tens to hundreds of millions of data points. These data points include protein sequences, small molecules, and gene expression profiles, with a combined sample size of 2 billion. (B) The multi-align model combines flexible encoder-only and encoder-decoder components. It takes sequences as input, which may contain any combination of tokens and scalar elements, processed by an encoder stack consisting of self-attention blocks. In encoder-only mode, a dedicated token prediction head outputs logits for token predictions, with an optional scalar prediction head for scalar outputs. In encoder-decoder mode, residual connections inject features from the encoder’s final hidden layer into each decoder layer, and a decoder-specific prediction head outputs the final logits. (C) Diverse downstream tasks performed by the multi-align model, mapped to their contributions within the steps of a typical drug discovery pipeline. (D) Diverse downstream tasks performed by the multi-align model, categorized by data type used in the fine-tuning process. (E) Performance of the multi-align model across a diverse set of tasks compared to SOTA.
  • Figure 2: Comparison of antibody/nanobody--antigen binding prediction between MAMMAL and AlphaFold 3 (AF3).(a) HER2 extracellular domain (ECD) binder/non-binder discrimination scores for MAMMAL versus AF3. (b-c) Binding prediction performance for nanobodies (VHHs) against three antigens: CD206, VWF, and TBG (green = binders, red = non-binders; mean ± SD; unpaired two-sided Student's t-test; *$P < 0.05$, ****$P < 0.0001$). (b) MAMMAL evaluation. (c) AF3 ipTM (Interface Predicted TM-score) evaluation. (d) HER2 ECD structure with representative binder/non-binder predictions. Therapeutic antibodies Trastuzumab (blue) and Pertuzumab (purple) shown for comparison. AF3 predicts both binders and non-binders interacting with the same domain, distinct from known therapeutic epitopes. (e) TBG structure with binding/non-binding VHHs. AF3 predicts distinct poses for binders vs non-binders.
  • Figure 3: A prompt, consisting of both token IDs and scalars is processed and enters the encoder. Both the encoder and the decoder output logits which are used for classification loss. Additionally, the output of the encoder is sent to a (learned) scalars prediction head which allows the prediction scalars for any subset of the tokens, and is used in the regression loss. In this illustration, a single scalar input ("12.7") is being used, and a single scalar's outputs are predicted by the model ("97.2"). However, the method fully supports an arbitrary number of input scalars and outputs.
  • Figure 4: Entity hierarchy for the task of binding prediction of two proteins, and organism prediction of the first one.
  • Figure 5: Entity hierarchy for the task of binding prediction of a TCR and an epitope. "Molecule System 1" represents the TCR complex, "Molecule System 2" represents the antigen, and the entire prompt represents their interaction.
  • ...and 1 more figures