Table of Contents
Fetching ...

AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design

Xinyan Zhao, Yi-Ching Tang, Akshita Singh, Victor J Cantu, KwanHo An, Junseok Lee, Adam E Stogsdill, Ibraheem M Hamdi, Ashwin Kumar Ramesh, Zhiqiang An, Xiaoqian Jiang, Yejin Kim

TL;DR

AbBiBench addresses the need for a biology-grounded benchmark for antibody design by evaluating Ab–Ag complexes rather than antibodies in isolation. The authors curate 14 datasets with >184,500 antibody mutants across 9 antigens and benchmark 15 models spanning MLMs, autoregressive PLMs, inverse folding, diffusion, and geometry-based approaches, on two tasks: zero-shot affinity prediction and design generation. They find that structure-conditioned inverse folding models best correlate model likelihood with experimental binding and excel at generating high-affinity variants, validated in a case study targeting H1N1 with in vitro ELISA. The work provides a rigorous, data-leakage-free framework that aligns model evaluation with the biophysics of binding and is poised to accelerate function-aware antibody design.

Abstract

We introduce AbBiBench (Antibody Binding Benchmarking), a benchmarking framework for antibody binding affinity maturation and design. Unlike previous strategies that evaluate antibodies in isolation, typically by comparing them to natural sequences with metrics such as amino acid recovery rate or structural RMSD, AbBiBench instead treats the antibody-antigen (Ab-Ag) complex as the fundamental unit. It evaluates an antibody design's binding potential by measuring how well a protein model scores the full Ab-Ag complex. We first curate, standardize, and share more than 184,500 experimental measurements of antibody mutants across 14 antibodies and 9 antigens-including influenza, lysozyme, HER2, VEGF, integrin, Ang2, and SARS-CoV-2-covering both heavy-chain and light-chain mutations. Using these datasets, we systematically compare 15 protein models including masked language models, autoregressive language models, inverse folding models, diffusion-based generative models, and geometric graph models by comparing the correlation between model likelihood and experimental affinity values. Additionally, to demonstrate AbBiBench's generative utility, we apply it to antibody F045-092 in order to introduce binding to influenza H1N1. We sample new antibody variants with the top-performing models, rank them by the structural integrity and biophysical properties of the Ab-Ag complex, and assess them with in vitro ELISA binding assays. Our findings show that structure-conditioned inverse folding models outperform others in both affinity correlation and generation tasks. Overall, AbBiBench provides a unified, biologically grounded evaluation framework to facilitate the development of more effective, function-aware antibody design models.

AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design

TL;DR

AbBiBench addresses the need for a biology-grounded benchmark for antibody design by evaluating Ab–Ag complexes rather than antibodies in isolation. The authors curate 14 datasets with >184,500 antibody mutants across 9 antigens and benchmark 15 models spanning MLMs, autoregressive PLMs, inverse folding, diffusion, and geometry-based approaches, on two tasks: zero-shot affinity prediction and design generation. They find that structure-conditioned inverse folding models best correlate model likelihood with experimental binding and excel at generating high-affinity variants, validated in a case study targeting H1N1 with in vitro ELISA. The work provides a rigorous, data-leakage-free framework that aligns model evaluation with the biophysics of binding and is poised to accelerate function-aware antibody design.

Abstract

We introduce AbBiBench (Antibody Binding Benchmarking), a benchmarking framework for antibody binding affinity maturation and design. Unlike previous strategies that evaluate antibodies in isolation, typically by comparing them to natural sequences with metrics such as amino acid recovery rate or structural RMSD, AbBiBench instead treats the antibody-antigen (Ab-Ag) complex as the fundamental unit. It evaluates an antibody design's binding potential by measuring how well a protein model scores the full Ab-Ag complex. We first curate, standardize, and share more than 184,500 experimental measurements of antibody mutants across 14 antibodies and 9 antigens-including influenza, lysozyme, HER2, VEGF, integrin, Ang2, and SARS-CoV-2-covering both heavy-chain and light-chain mutations. Using these datasets, we systematically compare 15 protein models including masked language models, autoregressive language models, inverse folding models, diffusion-based generative models, and geometric graph models by comparing the correlation between model likelihood and experimental affinity values. Additionally, to demonstrate AbBiBench's generative utility, we apply it to antibody F045-092 in order to introduce binding to influenza H1N1. We sample new antibody variants with the top-performing models, rank them by the structural integrity and biophysical properties of the Ab-Ag complex, and assess them with in vitro ELISA binding assays. Our findings show that structure-conditioned inverse folding models outperform others in both affinity correlation and generation tasks. Overall, AbBiBench provides a unified, biologically grounded evaluation framework to facilitate the development of more effective, function-aware antibody design models.

Paper Structure

This paper contains 15 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1:
  • Figure 2:
  • Figure 4: Overview of AbBiBench benchmarks. Antibody variants with experimentally determined affinity values are curated. Data modalities include amino acid sequences, wild-type antibody-antigen complexes, and affinity scores. A diverse set of baseline models includes general protein language models and specialized antibody models. All models are evaluated on two tasks: affinity prediction and antibody redesign. Five computational metrics assess newly designed antibodies from sequence plausibility, structural integrity, and binding affinity perspectives.
  • Figure 5: Spearman’s rank correlation coefficients between model log likelihood from various protein models and experimental binding affinities across multiple datasets. Models marked with * are structure-informed.
  • Figure 6: Proportion of top-10 ranked antibody designs achieving $\geq$5-fold affinity improvement across models and datasets. Only datasets reporting affinity as $-\!\log K_d$ were used. Datasets based on enrichment scores were excluded, as enrichment reflects relative sequence abundance and cannot determine fold change. Models marked with * are structure-informed.
  • ...and 2 more figures