Table of Contents
Fetching ...

Machine learning approaches for interpretable antibody property prediction using structural data

Kevin Michalewicz, Mauricio Barahona, Barbara Bravi

TL;DR

The chapter tackles the problem of designing antibodies by incorporating structural information into ML models, arguing that structure-aware representations enable both accurate predictions and mechanistic insight. It presents two graph-based frameworks, ANTIPASTI and INFUSSE, that fuse structure-derived signals with sequence embeddings to predict global properties like binding affinity and local properties such as residue B-factors, while enabling interpretability through model-agnostic and model-dependent analyses. ANTIPASTI reveals long-range affinity-determining correlations across regions (e.g., CDR-H3 with FR-L2), whereas INFUSSE demonstrates that combining ProtBERT embeddings with geometry-based graphs improves per-residue predictions, especially in loops and helices. The work advocates for interpretable, structure-informed antibody design and outlines future directions toward multi-property optimization and uncertainty quantification in in silico design workflows.

Abstract

Understanding the relationship between antibody sequence, structure and function is essential for the design of antibody-based therapeutics and research tools. Recently, machine learning (ML) models mostly based on the application of large language models to sequence information have been developed to predict antibody properties. Yet there are open directions to incorporate structural information, not only to enhance prediction but also to offer insights into the underlying molecular mechanisms. This chapter provides an overview of these approaches and describes two ML frameworks that integrate structural data (via graph representations) with neural networks to predict properties of antibodies: ANTIPASTI predicts binding affinity (a global property) whereas INFUSSE predicts residue flexibility (a local property). We survey the principles underpinning these models; the ways in which they encode structural knowledge; and the strategies that can be used to extract biologically relevant statistical signals that can help discover and disentangle molecular determinants of the properties of interest.

Machine learning approaches for interpretable antibody property prediction using structural data

TL;DR

The chapter tackles the problem of designing antibodies by incorporating structural information into ML models, arguing that structure-aware representations enable both accurate predictions and mechanistic insight. It presents two graph-based frameworks, ANTIPASTI and INFUSSE, that fuse structure-derived signals with sequence embeddings to predict global properties like binding affinity and local properties such as residue B-factors, while enabling interpretability through model-agnostic and model-dependent analyses. ANTIPASTI reveals long-range affinity-determining correlations across regions (e.g., CDR-H3 with FR-L2), whereas INFUSSE demonstrates that combining ProtBERT embeddings with geometry-based graphs improves per-residue predictions, especially in loops and helices. The work advocates for interpretable, structure-informed antibody design and outlines future directions toward multi-property optimization and uncertainty quantification in in silico design workflows.

Abstract

Understanding the relationship between antibody sequence, structure and function is essential for the design of antibody-based therapeutics and research tools. Recently, machine learning (ML) models mostly based on the application of large language models to sequence information have been developed to predict antibody properties. Yet there are open directions to incorporate structural information, not only to enhance prediction but also to offer insights into the underlying molecular mechanisms. This chapter provides an overview of these approaches and describes two ML frameworks that integrate structural data (via graph representations) with neural networks to predict properties of antibodies: ANTIPASTI predicts binding affinity (a global property) whereas INFUSSE predicts residue flexibility (a local property). We survey the principles underpinning these models; the ways in which they encode structural knowledge; and the strategies that can be used to extract biologically relevant statistical signals that can help discover and disentangle molecular determinants of the properties of interest.

Paper Structure

This paper contains 16 sections, 13 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Antibody sequence and structure. (A) Each variable domain of the heavy (VH) and light (VL) chains contains 3 CDRs and 4 FRs, defined via the Chothia position numbering scheme Chothia1987. (B) The structure of an antibody is Y-shaped, with two identical antigen binding fragments (Fab) and one crystallizable fragment (Fc). Each Fab consists of the variable (VH and VL) and first constant domains (CH1 and CL), while the Fc comprises the CH2 and CH3 domains.
  • Figure 2: Protein representations. (A) Left: In ENMs, a protein structure is modeled as a network in which the nodes (typically residues) are connected by springs that represent the elastic force acting between them. For two connected nodes $i$ and $j$, the spring constant $k_{ij}$ quantifies the elastic interaction strength. The ANM is a particular case where $k_{ij} = k(r_{ij}^*)$, with $r_{ij}^*$ denoting the equilibrium distance between nodes $i$ and $j$. Right: Graph representation of a protein structure as a network of $N$ nodes (typically residues) connected by edges $e_{ij}$ with weight $w(e_{ij})$ with adjacency matrix $A_{ij}=w(e_{ij})$. The diffusive dynamics and message-passing behavior of this system is governed by the associated graph Laplacian $\mathbf{L}$. (B) Structure embeddings (data-driven and derived by LLMs) capture both per-residue (state) and residue-residue (pair) properties of the protein structure. (C) A protein sequence $\mathbf{x}_\mathrm{s}$ can be represented using one-hot encoding, physicochemical feature sets, or context-rich embeddings learned by an LLM. (D) Dimensionality reduction applied to the representations in C facilitate inspection through visualization in a lower-dimensional space. One-hot encodings do not capture similarity in physicochemical properties (denoted by colors) between residues. The $F_\text{phys}$ features quantifying physicochemical properties exhibit consistent groupings. The LLM-learned embeddings capture both biochemical similarity and context-dependence, so that the representation of a given residue is influenced by its physicochemical properties and by the sequence to which it belongs (context).
  • Figure 3: Overview of ANTIPASTI. (A) From the PDB structure of the antibody-antigen complex, an elastic network model representation is created and a map of correlations between residues is calculated from normal modes, which carry the antigen's imprint on the antibody residues. Gaps are then added in place of absent residues through an alignment, and the antigen residues are removed from the correlation map to produce the input image to the ANTIPASTI's CNN architecture (see C). ANTIPASTI processes this input image and yields a binding affinity prediction and a map of affinity-relevant correlations. (B) ANTIPASTI predictions on the test set are compared to the ground-truth affinities for five training/test splits. (C) ML architecture of ANTIPASTI. Figure adapted from Ref. Michalewicz2024.
  • Figure 4: Overview of INFUSSE. (A) In the INFUSSE architecture, a frozen LLM (ProtBERT) and a diffusive Graph Convolutional Network (diff-GCN) are combined to predict B-factors for antibody-antigen complexes. In the sequence block ($S_{\mathrm{block}}$), input sequences are encoded by the frozen ProtBERT model and passed through a learnable non-linear layer $T_2$, then summed with their one-hot encoded version transformed by another learnable non-linear layer $T_1$, producing enriched sequence embeddings $\mathbf{X}$ that are further transformed by a learnable non-linear layer $T_3$. The resulting representations are summed with the output of the graph block $G_{\mathrm{block}}$, i.e., of the diff-GCN with learnable parameters $t$, $\mathbf{W}^{(0)}$ and $\mathbf{W}^{(1)}$ that takes $\mathbf{X}$ from $S_{\mathrm{block}}$ as input node features and leverages the Laplacian of a geometric graph constructed from the antibody-antigen structure. (B) Prediction errors of INFUSSE, $\varepsilon_{\text{INFUSSE}, n}^{(q)}$, and $S_{\mathrm{block}}$ alone, $\varepsilon_{S_\mathrm{block}, n}^{(q)}$, for each position $n$ of the heavy chain antibody variable region averaged over the samples $q \in \mathcal{Q}$. Figure adapted from Michalewicz2025.
  • Figure 5: Biophysical interpretation in ML models. (A) Ranking of ANTIPASTI importance factors for different antibody regions for protein targets. The importance factor (equation \ref{['eq:importance_factor']}) is expressed as a percentage of that of the best region, and we show the average (dot) and extreme values (error bars) over 5 training/test splits. (B) Boxplot of INFUSSE's $\Delta_{\mathrm{graph}}$ stratified by secondary structure types ($\alpha$-helix, $\beta$-strand or loop) for the antigens in the test set. (C) ANTIPASTI's affinity-relevant correlations for an antibody with a protein target (Mntc with Mab 305-78-7 complex, PDB entry: 5hdq). (D) Distribution of $\alpha$-carbon pairwise distances involved in the top $10$ ANTIPASTI's affinity-relevant correlations across all antibody structures. Panel B adapted from Ref. Michalewicz2025; panels A, C and D adapted from Ref. Michalewicz2024.