Table of Contents
Fetching ...

A decoupled alignment kernel for peptide membrane permeability predictions

Ali Amirahmadi, Gökçe Geylan, Leonardo De Maria, Farzaneh Etminani, Mattias Ohlsson, Alessandro Tibo

TL;DR

This work tackles the challenge of predicting cyclic peptide permeability with calibrated uncertainty in limited data settings. It introduces monomer-aware decoupled global alignment kernels (MD-GAK) and a position-aware variant (PMD-GAK) that pair chemically meaningful monomer fingerprints with sequence alignment within Gaussian Processes, ensuring positive definiteness and robust uncertainty estimates. Evaluations on CycPeptMPDB across leakage-aware and scaffold-based splits show that MD-GAK/PMD-GAK improve discrimination and calibration relative to strong baselines, while TAN_sim and convex mixtures reveal complementary strengths between alignment and substructure signals. By bridging mature small-molecule kernel methods with peptide topology, the approach enables data-efficient, uncertainty-aware screening and highlights a path toward richer monomer encoders from chemical language models as data scale grows.

Abstract

Cyclic peptides are promising modalities for targeting intracellular sites; however, cell-membrane permeability remains a key bottleneck, exacerbated by limited public data and the need for well-calibrated uncertainty. Instead of relying on data-eager complex deep learning architecture, we propose a monomer-aware decoupled global alignment kernel (MD-GAK), which couples chemically meaningful residue-residue similarity with sequence alignment while decoupling local matches from gap penalties. MD-GAK is a relatively simple kernel. To further demonstrate the robustness of our framework, we also introduce a variant, PMD-GAK, which incorporates a triangular positional prior. As we will show in the experimental section, PMD-GAK can offer additional advantages over MD-GAK, particularly in reducing calibration errors. Since our focus is on uncertainty estimation, we use Gaussian Processes as the predictive model, as both MD-GAK and PMD-GAK can be directly applied within this framework. We demonstrate the effectiveness of our methods through an extensive set of experiments, comparing our fully reproducible approach against state-of-the-art models, and show that it outperforms them across all metrics.

A decoupled alignment kernel for peptide membrane permeability predictions

TL;DR

This work tackles the challenge of predicting cyclic peptide permeability with calibrated uncertainty in limited data settings. It introduces monomer-aware decoupled global alignment kernels (MD-GAK) and a position-aware variant (PMD-GAK) that pair chemically meaningful monomer fingerprints with sequence alignment within Gaussian Processes, ensuring positive definiteness and robust uncertainty estimates. Evaluations on CycPeptMPDB across leakage-aware and scaffold-based splits show that MD-GAK/PMD-GAK improve discrimination and calibration relative to strong baselines, while TAN_sim and convex mixtures reveal complementary strengths between alignment and substructure signals. By bridging mature small-molecule kernel methods with peptide topology, the approach enables data-efficient, uncertainty-aware screening and highlights a path toward richer monomer encoders from chemical language models as data scale grows.

Abstract

Cyclic peptides are promising modalities for targeting intracellular sites; however, cell-membrane permeability remains a key bottleneck, exacerbated by limited public data and the need for well-calibrated uncertainty. Instead of relying on data-eager complex deep learning architecture, we propose a monomer-aware decoupled global alignment kernel (MD-GAK), which couples chemically meaningful residue-residue similarity with sequence alignment while decoupling local matches from gap penalties. MD-GAK is a relatively simple kernel. To further demonstrate the robustness of our framework, we also introduce a variant, PMD-GAK, which incorporates a triangular positional prior. As we will show in the experimental section, PMD-GAK can offer additional advantages over MD-GAK, particularly in reducing calibration errors. Since our focus is on uncertainty estimation, we use Gaussian Processes as the predictive model, as both MD-GAK and PMD-GAK can be directly applied within this framework. We demonstrate the effectiveness of our methods through an extensive set of experiments, comparing our fully reproducible approach against state-of-the-art models, and show that it outperforms them across all metrics.

Paper Structure

This paper contains 47 sections, 2 theorems, 49 equations, 3 figures, 6 tables.

Key Result

Theorem 1

Let $\mathcal{X}$ be the set of monomers and let $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}_{\ge 0}$ be a positive semidefinite (PSD) local kernel (e.g., Tanimoto on Morgan fingerprints). Fix $\lambda = 1$. For sequences $A=(s_1,\dots,s_n)$ and $B=(t_1,\dots,t_m)$, define the dynamic program and set $K(A,B):=M_{n,m}$. Then $K$ is a PSD kernel on the space of finite monomer sequences.

Figures (3)

  • Figure 1: GP prior and posterior for PAMPA prediction using the proposed kernel. The data conditioning contracts the posterior uncertainty and shifts the mean toward test samples observations.
  • Figure 2: Illustration of the PMD-GAK dynamic program with a compactly supported Toeplitz position kernel $\omega_T(i,j)=\psi(|i-j|)$ of bandwidth $T=3$. Gray cells indicate entries where $\omega_T(i,j)=0$ (hence $\kappa_T(i,j)=0$), so $M_{i,j}$ does not need to be updated. For white cells inside the band, $M_{i,j}$ is computed from its three predecessors according to $M_{i,j}=\kappa_T(i,j)M_{i-1,j-1}+M_{i-1,j}+M_{i,j-1}$ (arrows). Because $\omega_T(i,j)$ depends only on the index difference $|i-j|$ (Toeplitz structure), the nonzero entries form a diagonal band around the main diagonal.
  • Figure 3: Predicted probability distributions (outer test, canonical-group split). The x-axis shows the predicted probabilities, and the y-axis shows their estimated density of these probabilities on the test set. Kernel-based GP models produce score histograms that closely track the empirical distribution of PAMPA values in the dataset, whereas RF, XGBoost, and ChemBERTa yield noticeably different score profiles. This alignment is consistent with their stronger calibration (lower Brier/ECE; Tables \ref{['tab:label-stratified_alignment']}–\ref{['tab:canonical-group-stratified_alignment']}) and suggests that the monomer-aware kernels capture permeability-relevant sequence/chemical structure more effectively under canonical splitting.

Theorems & Definitions (2)

  • Theorem 1: Positive semidefiniteness of the Decoupled GA kernel
  • Theorem 2: Positive semidefiniteness of the Position-aware Decoupled GA kernel