Table of Contents
Fetching ...

PLM-eXplain: Divide and Conquer the Protein Embedding Space

Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln

TL;DR

The proposed PLM-eXplain enables biological interpretation of model decisions without sacrificing accuracy, offering a generalizable solution for enhancing PLM interpretability across various downstream applications.

Abstract

Protein language models (PLMs) have revolutionised computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. We present an explainable adapter layer - PLM-eXplain (PLM-X), that bridges this gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power. Using embeddings from ESM2, our adapter incorporates well-established properties, including secondary structure and hydropathy while maintaining high performance. We demonstrate the effectiveness of our approach across three protein-level classification tasks: prediction of extracellular vesicle association, identification of transmembrane helices, and prediction of aggregation propensity. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalisable solution for enhancing PLM interpretability across various downstream applications. This work addresses a critical need in computational biology by providing a bridge between powerful deep learning models and actionable biological insights.

PLM-eXplain: Divide and Conquer the Protein Embedding Space

TL;DR

The proposed PLM-eXplain enables biological interpretation of model decisions without sacrificing accuracy, offering a generalizable solution for enhancing PLM interpretability across various downstream applications.

Abstract

Protein language models (PLMs) have revolutionised computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. We present an explainable adapter layer - PLM-eXplain (PLM-X), that bridges this gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power. Using embeddings from ESM2, our adapter incorporates well-established properties, including secondary structure and hydropathy while maintaining high performance. We demonstrate the effectiveness of our approach across three protein-level classification tasks: prediction of extracellular vesicle association, identification of transmembrane helices, and prediction of aggregation propensity. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalisable solution for enhancing PLM interpretability across various downstream applications. This work addresses a critical need in computational biology by providing a bridge between powerful deep learning models and actionable biological insights.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our encoder-decoder model architecture splits protein language model embeddings into two complementary subspaces: one capturing explicit physicochemical features and another containing the residual predictive information. An adversarial component stimulates this separation while preserving the original embeddings' prediction capabilities. We evaluate the partitioned embeddings using XGBoost and CNN models on three downstream protein-level tasks.
  • Figure 2: Global feature importance for predicting protein aggregation, comparing three different embedding approaches: the original (ESM2) embeddings, partitioned embeddings (PLM-X), and crafted only embeddings. For the original model, top features are unknown. For the adapted model, several known features, such as GRAVY and secondary structure components (extended $\beta$-strands, SS3 E and SS8 E), and proline (P) can be identified. Figure \ref{['figsupp:global_shap']} shows detailed SHAP plots for all three downstream prediction tasks.
  • Figure 3: Local interpretation for the transmembrane helix predictions for the Leptin protein.(A) The full-length structure of leptin (AF-P41159-F1), highlighted regions are colour-coded based on the highest activation by Kernel 4. (B) The most informative features determined by the sum of absolute SHAP values. Each dot represents a feature at a specific position within a motif. (C) Summed SHAP values for a filter, showcasing the top 5 features at each position. This plot highlights the positional importance of features along the activation values.
  • Figure S1: Data curation pipeline for the model adaptation. Human proteome from AlphaFoldDB jumper2021highlyvaradi2024alphafold was annotated with secondary structure components and other sequence-based features. Resulting 34 features were used to create knowledge informed subspace.
  • Figure S2: SHAP summary plots for global interpretability for three different downstream prediction tasks: (A) aggregation propensity prediction, (B) association with extracellular vesicles (EV) and (C) transmembrane helix predictions. For each prediction task feature importances are shown for the original (ESM2), partitioned, and crafted only (baseline) embeddings. The plot shows a summary of how the top features in a dataset impact the model’s output. Each instance of the explanation is represented by a single dot on each feature row. Colour is used to display the original value of a feature. For the original model, top features are unknown. For the partitioned model, several known features, such as secondary structure features (SS8, SS3), GRAVY, and accessible surface area (ASA) are displayed. For the baseline model, which only uses knowledge-informed subspace of embeddings, all the features are explainable.