Table of Contents
Fetching ...

OneProt: Towards Multi-Modal Protein Foundation Models

Klemens Flöge, Srisruthi Udayakumar, Johanna Sommer, Marie Piraud, Stefan Kesselheim, Vincent Fortuin, Stephan Günneman, Karel J van der Weg, Holger Gohlke, Erinc Merdivan, Alina Bazarova

TL;DR

OneProt advances protein representation by extending ImageBind-style multi-modal learning to align sequence with structure, binding pockets, and text descriptors, enabling cross-modal retrieval and versatile downstream predictions. It leverages a mixture of pre-trained encoders (sequence: ESM2, structure: ProNet, pockets, text: BiomedBERT) plus trainable projection heads, trained with a symmetric InfoNCE objective to achieve emergent cross-modal alignment through sequence-centered training. Across diverse tasks—enzyme function prediction, binding-site analysis, thermostability, protein–protein interactions, and GO annotations—OneProt demonstrates competitive or superior performance with data-efficient training relative to larger baselines, while revealing modality-specific contributions (notably the pocket and structure encoders). This multi-modal foundation model holds promise for drug discovery, biocatalysis planning, and protein design, offering a modular framework that can incorporate additional modalities and downstream tasks with moderate compute.

Abstract

Recent advances in Artificial Intelligence have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, text, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine-tuning scheme that focuses on pairwise alignment with sequence data rather than requiring full matches. This novel approach comprises a mix of Graph Neural Networks and transformer architectures. It demonstrates strong performance in retrieval tasks and showcases the efficacy of multi-modal systems in Protein Machine Learning through a broad spectrum of downstream baselines, including enzyme function prediction and binding site analysis. Furthermore, OneProt enables the transfer of representational information from specialized encoders to the sequence encoder, enhancing capabilities for distinguishing evolutionarily related and unrelated sequences and exhibiting representational properties where evolutionarily related proteins align in similar directions within the latent space. In addition, we extensively investigate modality ablations to identify the encoders that contribute most to predictive performance, highlighting the significance of the binding site encoder, which has not been used in similar models previously. This work expands the horizons of multi-modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.

OneProt: Towards Multi-Modal Protein Foundation Models

TL;DR

OneProt advances protein representation by extending ImageBind-style multi-modal learning to align sequence with structure, binding pockets, and text descriptors, enabling cross-modal retrieval and versatile downstream predictions. It leverages a mixture of pre-trained encoders (sequence: ESM2, structure: ProNet, pockets, text: BiomedBERT) plus trainable projection heads, trained with a symmetric InfoNCE objective to achieve emergent cross-modal alignment through sequence-centered training. Across diverse tasks—enzyme function prediction, binding-site analysis, thermostability, protein–protein interactions, and GO annotations—OneProt demonstrates competitive or superior performance with data-efficient training relative to larger baselines, while revealing modality-specific contributions (notably the pocket and structure encoders). This multi-modal foundation model holds promise for drug discovery, biocatalysis planning, and protein design, offering a modular framework that can incorporate additional modalities and downstream tasks with moderate compute.

Abstract

Recent advances in Artificial Intelligence have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, text, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine-tuning scheme that focuses on pairwise alignment with sequence data rather than requiring full matches. This novel approach comprises a mix of Graph Neural Networks and transformer architectures. It demonstrates strong performance in retrieval tasks and showcases the efficacy of multi-modal systems in Protein Machine Learning through a broad spectrum of downstream baselines, including enzyme function prediction and binding site analysis. Furthermore, OneProt enables the transfer of representational information from specialized encoders to the sequence encoder, enhancing capabilities for distinguishing evolutionarily related and unrelated sequences and exhibiting representational properties where evolutionarily related proteins align in similar directions within the latent space. In addition, we extensively investigate modality ablations to identify the encoders that contribute most to predictive performance, highlighting the significance of the binding site encoder, which has not been used in similar models previously. This work expands the horizons of multi-modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.

Paper Structure

This paper contains 10 sections, 15 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: Overview of OneProt’s alignment of protein sequences with other modalities for comprehensive cross-modal integration. Training is performed using pairs comprising a sequence and another modality, leading to the emergent alignment between all other modalities, as indicated by the dashed lines.
  • Figure 2: Overview of the OneProt model. The model aligns multiple modalities, including primary protein sequence, 3D protein structure, binding pockets, and text annotations. Each modality is processed by its respective encoder, generating embeddings aligned in a shared latent space, facilitating cross-modal learning and integration.
  • Figure 3: Alignment performance across modality combinations paired (left column) and not paired (emergent, right column) during training for OneProt-5 (top row) and OneProt-4 (bottom row). The axes of the polygons correspond to the modality pairs, and the vertices correspond to R@1 (inner polygon), R@10 (middle polygon), and R@100 (outer polygon), which represent the fraction of queries for which the correct (ground-truth) match appears among the top 1, top 10, or top 100 retrieved embeddings, respectively, with the best possible value being equal to 1. MR is the Median Rank of the corresponding embedding in the other modality, best possible being equal 1.
  • Figure 4: Model performance comparison based on Area Under Precision Recall curve (AUPR) scores for TopEnzyme. Each boxplot shows the AUPR distribution for a method (TopEC, CLEAN, ESM-2, Protrek-35M, ProTrek-650M, OneProt).
  • Figure 5: Cosine Similarity distributions for models ESM-2, ProTrek-35M and -650M, OneProt-4 and -5. The plot shows the similarity of a given protein to three groups: the 50 most evolutionarily similar proteins, the 50 most evolutionarily divergent sequences, and 1000 unrelated sequences. While all models partially capture evolutionary relationships, OneProt distinctly separates the three classes, demonstrating its ability to generate meaningful sequence representations.
  • ...and 7 more figures