Table of Contents
Fetching ...

Boosting Protein Language Models with Negative Sample Mining

Yaoyao Xu, Xinjian Zhao, Xiaozhuang Song, Benyou Wang, Tianshu Yu

TL;DR

This work tackles the over-emphasis on co-evolution signals in protein language models by introducing NM-Transformer, a negative sample mining framework that fine-tunes PLMs to de-emphasize irrelevant alignment in cross-attention. Negative samples are generated for both protein-wise and protein-pair tasks, and a cross-attention-based loss drives the model toward a more uniform alignment with negatives, complementing the supervised objective. Across five datasets and multiple PLMs, NM-Transformer yields consistent performance gains, particularly helping smaller PLMs close the gap with larger models, and enables interpretable attention that highlights residues near binding interfaces. The approach offers a practical, interpretable enhancement to protein representation learning with potential broad impact on protein function prediction and interaction analysis.

Abstract

We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning. Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge, in a way that networks are trained to distill invaluable insights from negative samples, constituted by protein pairs sourced from disparate categories. By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space. This advanced strategy not only amplifies performance but also reflects the nuanced biological behaviors exhibited by proteins, offering aligned evidence with traditional biological mechanisms such as protein-protein interaction. We experimentally observed improved performance on various tasks over datasets, on top of several well-established large protein models. This innovative paradigm opens up promising horizons for further progress in the realms of protein research and computational biology.

Boosting Protein Language Models with Negative Sample Mining

TL;DR

This work tackles the over-emphasis on co-evolution signals in protein language models by introducing NM-Transformer, a negative sample mining framework that fine-tunes PLMs to de-emphasize irrelevant alignment in cross-attention. Negative samples are generated for both protein-wise and protein-pair tasks, and a cross-attention-based loss drives the model toward a more uniform alignment with negatives, complementing the supervised objective. Across five datasets and multiple PLMs, NM-Transformer yields consistent performance gains, particularly helping smaller PLMs close the gap with larger models, and enables interpretable attention that highlights residues near binding interfaces. The approach offers a practical, interpretable enhancement to protein representation learning with potential broad impact on protein function prediction and interaction analysis.

Abstract

We introduce a pioneering methodology for boosting large language models in the domain of protein representation learning. Our primary contribution lies in the refinement process for correlating the over-reliance on co-evolution knowledge, in a way that networks are trained to distill invaluable insights from negative samples, constituted by protein pairs sourced from disparate categories. By capitalizing on this novel approach, our technique steers the training of transformer-based models within the attention score space. This advanced strategy not only amplifies performance but also reflects the nuanced biological behaviors exhibited by proteins, offering aligned evidence with traditional biological mechanisms such as protein-protein interaction. We experimentally observed improved performance on various tasks over datasets, on top of several well-established large protein models. This innovative paradigm opens up promising horizons for further progress in the realms of protein research and computational biology.
Paper Structure (11 sections, 9 equations, 4 figures, 3 tables)

This paper contains 11 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: NM-Transformer framework. Our framework is designed for protein-wise and protein-pair tasks, taking single protein sequences or pairs of protein sequences as inputs. It consists of two main steps: 1) Negative sampling: for protein-wise tasks, we sample proteins with differing properties as negative examples based on task-predicted properties (e.g., solubility); for protein-pair tasks, we sample non-interacting proteins as negative examples. 2) Negative sample mining: we optimize the cross-attention matrix to align with uniform distributions of input and negative sample sequences, guide PLMs to learn discriminative embeddings, and generate representations for downstream tasks using self-attention layers.
  • Figure 2: The histogram on the left demonstrates the performance improvement relative to MLP when training from scratch and fine-tuning using NM-Transformer and Transformer. The right radar figure illustrates the performance corresponding to the number of negative samples, as the number of negative samples increases, the performance of NM-Transformer continues to improve. The experiments were all run on the Sub dataset using ESM-2(8M).
  • Figure 3: Subfigures (a) and (b) show the cross-attention matrices produced by NM-Transformer and Transformer, highlighting clear differences between positive and negative pairs in our approach's matrix that are absent in the Transformer model's matrix. The regions surpassing the average attention score threshold are marked in the deepest shade of blue.
  • Figure 4: The figure displays the protein-protein interaction complexes of the Human Tissue Factor (PDB ID: 1ahw) and Human Fanconi anemia-associated protein (PDB ID: 2MUR). Results from NM-Transformer and Transformer are shown on the left and right, respectively. The top 2 scoring residues in the cross-attention matrix for Chain-A and Chain-B are colored in red and blue. The surfaces of Chain-A and Chain-B are highlighted in pink and light blue. The score represents the response scores of the top 2 residues.