TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation
Lin Zongying, Li Hao, Lv Liuzhenghao, Lin Bin, Zhang Junwu, Chen Calvin Yu-Chian, Yuan Li, Tian Yonghong
TL;DR
TaxDiff addresses the lack of controllable protein sequence generation by introducing a taxonomic-guided diffusion framework that conditions the denoising transformer on a tax-id while employing a patchify attention mechanism to capture global and local sequence features. By reclassifying UniProt taxonomy to family/species levels and integrating tax-id conditioning into each denoise step, TaxDiff achieves state-of-the-art performance on both unconditional and taxonomic-guided generation, with faster sampling (about 1/4 of competing diffusion-based methods). The combination of the Denoise Transformer with Global and Local attention and the adaLN-enabled training yields sequences that are structurally coherent and highly consistent with target folds, as evidenced by high pLDDT, TM-score, and low RMSD across AFDB and PDB benchmarks. The approach offers practical impact by reducing screening time for design tasks and enabling species-specific protein generation, with potential extensions to protein complexes and broader design applications.
Abstract
Designing protein sequences with specific biological functions and structural stability is crucial in biology and chemistry. Generative models already demonstrated their capabilities for reliable protein design. However, previous models are limited to the unconditional generation of protein sequences and lack the controllable generation ability that is vital to biological tasks. In this work, we propose TaxDiff, a taxonomic-guided diffusion model for controllable protein sequence generation that combines biological species information with the generative capabilities of diffusion models to generate structurally stable proteins within the sequence space. Specifically, taxonomic control information is inserted into each layer of the transformer block to achieve fine-grained control. The combination of global and local attention ensures the sequence consistency and structural foldability of taxonomic-specific proteins. Extensive experiments demonstrate that TaxDiff can consistently achieve better performance on multiple protein sequence generation benchmarks in both taxonomic-guided controllable generation and unconditional generation. Remarkably, the sequences generated by TaxDiff even surpass those produced by direct-structure-generation models in terms of confidence based on predicted structures and require only a quarter of the time of models based on the diffusion model. The code for generating proteins and training new versions of TaxDiff is available at:https://github.com/Linzy19/TaxDiff.
