Table of Contents
Fetching ...

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Jiale Zhao, Wanru Zhuang, Jia Song, Yaqi Li, Shuqi Lu

TL;DR

This work tackles the limitation of residue-only pre-training in 3D protein modeling by introducing Span Mask Protein Chain (SMPC) to prevent information leakage that otherwise trivializes residue tasks when atoms are included. It proposes Vabs-Net, a Vector Aware Bilevel Sparse Attention network that jointly models residues and atoms via two interacting sparse attention tracks and edge-direction encodings, enabling rich all-atom and residue representations. Through SMPC-driven pre-training and diverse downstream evaluations (EC/GO function prediction, binding-site prediction, and molecular docking), the approach achieves state-of-the-art results, including improved docking performance, demonstrating strong transfer of structure-aware priors. The method advances protein representation learning by effectively integrating multi-level structural information with scalable, task-agnostic pre-training, offering practical benefits for structure-based drug design and functional annotation.

Abstract

In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

TL;DR

This work tackles the limitation of residue-only pre-training in 3D protein modeling by introducing Span Mask Protein Chain (SMPC) to prevent information leakage that otherwise trivializes residue tasks when atoms are included. It proposes Vabs-Net, a Vector Aware Bilevel Sparse Attention network that jointly models residues and atoms via two interacting sparse attention tracks and edge-direction encodings, enabling rich all-atom and residue representations. Through SMPC-driven pre-training and diverse downstream evaluations (EC/GO function prediction, binding-site prediction, and molecular docking), the approach achieves state-of-the-art results, including improved docking performance, demonstrating strong transfer of structure-aware priors. The method advances protein representation learning by effectively integrating multi-level structural information with scalable, task-agnostic pre-training, offering practical benefits for structure-based drug design and functional annotation.

Abstract

In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.
Paper Structure (19 sections, 13 equations, 2 figures, 11 tables)

This paper contains 19 sections, 13 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: With all-atom added, the possible range for residue position is limited, thus resulting in easier prediction for residue position and angle between edges, etc. When all atoms are utilized, the prediction of residue positions and inter-edge angles relies predominantly on other atoms rather than residues themselves.
  • Figure 2: An overview of Vabs-Net architecture. We use atom type, residue type, and preprocessed ESM features to encode atom nodes. Residue nodes share representation with their corresponding alpha carbon. Encoding of edges is through vector edge encoder and distance edge encoder to encode direction and distance of edges. We input node and edge encoding into a two-track sparse attention module. Each track includes a sparse attention module and a feedforward neural network. This module first updates atom representations with the atom-atom track and then updates alpha carbon atom nodes by residue-residue track. In this way, two tracks interact through alpha carbon atom nodes. Finally, representations of nodes and edges are used for various pre-training and downstream tasks. In addition, we show the span mask protein chain strategy on the left. Atom nodes other than alpha carbon are removed in the masked area of the span mask protein chain method during pre-training.