Table of Contents
Fetching ...

InstructPro: Natural Language Guided Ligand-Binding Protein Design

Zhenqiao Song, Ramith Hettiarachchi, Chuan Li, Jianwen Xie, Lei Li

Abstract

The de novo design of ligand-binding proteins with tailored functions is essential for advancing biotechnology and molecular medicine, yet existing AI approaches are limited by scarce protein-ligand complex data. To circumvent this data bottleneck, we leverage the abundant natural language descriptions characterizing protein-ligand interactions. Here, we introduce InstructPro, a family of generative models that design proteins following the guidance of natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified function descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants -- InstructPro-1B and InstructPro-3B -- that substantially outperform strong baselines. InstructPro-1B achieves an AlphaFold3 ipTM of 0.918 and a binding affinity of -8.764 on seen ligands, while maintaining robust performance in a zero-shot setting with scores of 0.869 and -6.713, respectively. These results are accompanied by novelty scores of 70.1% and 68.8%, underscoring the model's ability to generalize beyond the training set. Furthermore, the model yields a superior binding free energy of -20.9 kcal/mol and an average of 5.82 intermolecular hydrogen bonds, validating its proficiency in designing high-affinity ligand-binding proteins. Notably, scaling to InstructPro-3B further improves the zero-shot ipTM to 0.882, binding affinity to -6.797, and binding free energy to -25.8 kcal/mol, demonstrating clear performance gains associated with increased model capacity. These findings highlight the power of natural language-guided generative models to mitigate the data bottlenecks in traditional structure-based methods, significantly broadening the scope of de novo protein design.

InstructPro: Natural Language Guided Ligand-Binding Protein Design

Abstract

The de novo design of ligand-binding proteins with tailored functions is essential for advancing biotechnology and molecular medicine, yet existing AI approaches are limited by scarce protein-ligand complex data. To circumvent this data bottleneck, we leverage the abundant natural language descriptions characterizing protein-ligand interactions. Here, we introduce InstructPro, a family of generative models that design proteins following the guidance of natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified function descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants -- InstructPro-1B and InstructPro-3B -- that substantially outperform strong baselines. InstructPro-1B achieves an AlphaFold3 ipTM of 0.918 and a binding affinity of -8.764 on seen ligands, while maintaining robust performance in a zero-shot setting with scores of 0.869 and -6.713, respectively. These results are accompanied by novelty scores of 70.1% and 68.8%, underscoring the model's ability to generalize beyond the training set. Furthermore, the model yields a superior binding free energy of -20.9 kcal/mol and an average of 5.82 intermolecular hydrogen bonds, validating its proficiency in designing high-affinity ligand-binding proteins. Notably, scaling to InstructPro-3B further improves the zero-shot ipTM to 0.882, binding affinity to -6.797, and binding free energy to -25.8 kcal/mol, demonstrating clear performance gains associated with increased model capacity. These findings highlight the power of natural language-guided generative models to mitigate the data bottlenecks in traditional structure-based methods, significantly broadening the scope of de novo protein design.

Paper Structure

This paper contains 5 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The overall architecture of InstructPro. The text encoder processes human instruction and protein function description in natural language. The shared memory module extracts essential contextual semantics from function description. The ligand encoder encodes molecular representations from the ligand SMILES formula, capturing the chemical context of the target ligand. Conditioned on both the critical contextual semantics and ligand representations, the protein decoder generates a protein sequence that both aligns with the function specification and is able to bind to the target ligand.
  • Figure 2: Evaluation of InstructPro.a, The distribution of RMSD between the folded structure of designed proteins and ground truth proteins. b, The impact of removing (w/o) text or ligand encoders. c, The effect of applying pretrained text encoder or ligand encoder initialization. d, The influence of removing the shared memory module. e, The impact of memory eliciting vector size. f, The diversity score of Pinal and InstructPro. g, The binding affinity distribution of Pinal and InstructPro when applying sampling strategy.
  • Figure 3: Ligand-binding proteins designed by InstructPro-1B. In all examples, natural proteins are depicted in grey, while the proteins designed by InstructPro-1B are shown in green or blue. The hydrogen bonds between the designed protein and the target ligand are shown in purple.