Table of Contents
Fetching ...

IgPose: A Generative Data-Augmented Pipeline for Robust Immunoglobulin-Antigen Binding Prediction

Tien-Cuong Bui, Injae Chung, Wonjun Lee, Junsu Ko, Juyong Lee

Abstract

Predicting immunoglobulin-antigen (Ig-Ag) binding remains a significant challenge due to the paucity of experimentally-resolved complexes and the limited accuracy of de novo Ig structure prediction. We introduce IgPose, a generalizable framework for Ig-Ag pose identification and scoring, built on a generative data-augmentation pipeline. To mitigate data scarcity, we constructed the Structural Immunoglobulin Decoy Database (SIDD), a comprehensive repository of high-fidelity synthetic decoys. IgPose integrates equivariant graph neural networks, ESM-2 embeddings, and gated recurrent units to synergistically capture both geometric and evolutionary features. We implemented interface-focused k-hop sampling with biologically guided pooling to enhance generalization across diverse interfaces. The framework comprises two sub-networks--IgPoseClassifier for binding pose discrimination and IgPoseScore for DockQ score estimation--and achieves robust performance on curated internal test sets and the CASP-16 benchmark compared to physics and deep learning baselines. IgPose serves as a versatile computational tool for high-throughput antibody discovery pipelines by providing accurate pose filtering and ranking. IgPose is available on GitHub (https://github.com/arontier/igpose).

IgPose: A Generative Data-Augmented Pipeline for Robust Immunoglobulin-Antigen Binding Prediction

Abstract

Predicting immunoglobulin-antigen (Ig-Ag) binding remains a significant challenge due to the paucity of experimentally-resolved complexes and the limited accuracy of de novo Ig structure prediction. We introduce IgPose, a generalizable framework for Ig-Ag pose identification and scoring, built on a generative data-augmentation pipeline. To mitigate data scarcity, we constructed the Structural Immunoglobulin Decoy Database (SIDD), a comprehensive repository of high-fidelity synthetic decoys. IgPose integrates equivariant graph neural networks, ESM-2 embeddings, and gated recurrent units to synergistically capture both geometric and evolutionary features. We implemented interface-focused k-hop sampling with biologically guided pooling to enhance generalization across diverse interfaces. The framework comprises two sub-networks--IgPoseClassifier for binding pose discrimination and IgPoseScore for DockQ score estimation--and achieves robust performance on curated internal test sets and the CASP-16 benchmark compared to physics and deep learning baselines. IgPose serves as a versatile computational tool for high-throughput antibody discovery pipelines by providing accurate pose filtering and ranking. IgPose is available on GitHub (https://github.com/arontier/igpose).
Paper Structure (38 sections, 20 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 38 sections, 20 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: An overview of IgPose model architecture and its internal components. (a) The data preparation pipeline from data collection, data cleaning, and decoy generation with Chai-1 chaidiscovery2024 and Boltz-2 wohlwend2024boltz1passaro2025boltz to data preparation and model training and evaluation. (b) The IgPose - equivariant message-passing architecture built upon EGNN satorras2021n and a customized GRU module. (c) The architecture of the customized GRU module. (d) Alternative global pooling operators corresponding to different selective regions.
  • Figure 2: Model performance comparison across (a) classification and (b) regression tasks on our internal SID test datasets and CASP-16. Results of our three implemented models (IgPoseClassifier, FastEGNN, and MACE) are averaged over five executions and colored from pale to deep blue. All existing method results are colored from pale to deep green. In IgPC-AbET and IgPS-AbES settings, we perform weighted average on output probabilities/scores of IgPoseClassifier/IgPoseScore and AbEpiTarget/AbEpiScore with weights of 0.7 and 0.3, respectively. FS and FT refer to two IgPoseScore variants: training from scratch and finetuning from a IgPoseClassifier checkpoint. Error bars of our models represent standard deviation of five executions. Please refer to \ref{['tab:detailed_classification']} for a detailed comparison of methods on five metrics. (Best viewed in color)
  • Figure 3: Representative decoy structures from our SIDD test dataset and CASP-16 competition submissions. The decoy structures -- shown as cartoon -- are categorized into four groups based on the IgPoseClassifier prediction: true positive, true negative, false positive, and false negative. Ground truth structures are shown as transparent surface. For SIDD dataset structures, the prefixes 'Chai1' and 'Boltz2' specify the computational method used for decoy generation, followed by letters that reference the original PDB-ID. For CASP-16 structures CASP16, 'H12[number]' denotes the CASP-16 target ID and 'TS[number]' corresponds to the participating group that submitted the prediction. PDB-IDs of ground-truth structures are shown in parenthesis. Each structure is accompanied by four numerical scores: (from left to right) global TMscore, global DockQ score, IgPoseClassifier score, and IgPoseScore. IgPoseScore was computed for true/false positive structures only. Asterisk (*) indicates DockQ scores provided by CASP-16 assessors; dagger ($\dagger$) indicates incomputable DockQ scores, which apply exclusively to non-cognate Ig-Ag decoys. Color-coding scheme used to distinguish antibodies (Ab), nanobodies (Nb), antigens (Ag), T-cell Receptors (TCR), peptide-MHC complexes (pMHC), and single-chain variable fragments (scFv) is indicated at the bottom of the figure.
  • Figure 4: Comparison of Top-K success rates for IgPose and AbEpiTope on two benchmark datasets. Blue-shaded and green-shaded bars show the SID-R and CASP-16 success rates, respectively. Here, success rate is defined as the precision in Top-K: the proportion of true positive samples among the Top-K ranked predictions. Here, IgPoseClassifier or AbEpiTarget are first used to filter out predicted negative samples; the remaining predicted positives are then ranked by IgPoseScore or AbEpiScore; precision is then calculated among the top 10, 20, 50, and 100 ranked candidates. IgPose+AbEpiTope denotes the weighted ensemble variant of the two methods.
  • Figure S1: A comparison of performance of pooling strategies on SID-CA test and CASP‑16 benchmarks. Labels denote the set of nodes used in the global weighted sum pooling operation. 'w/o' denotes exclusion of the specified node set from pooling, while the suffix 'only' indicate exclusive use of the specified set of nodes. All node: weighted sum pooling over all nodes in a graph. Ensemble of 3 best: average of output probabilities from the three models corresponding to the pale blue bars. Further detail can be found in \ref{['sec:pooling_strategy']}.
  • ...and 7 more figures