Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

Yang Tan; Lirong Zheng; Bozitao Zhong; Liang Hong; Bingxin Zhou

Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

Yang Tan, Lirong Zheng, Bozitao Zhong, Liang Hong, Bingxin Zhou

TL;DR

The paper questions the blanket benefit of incorporating amino acid sequence information into protein representations for structure-related tasks. It introduces ProtLOCA, a roto-equivariant GVP-based framework that encodes local geometric structure without relying on amino acid types, and validates it on global and local structure alignment tasks. On independent CATH-based benchmarks, ProtLOCA achieves state-of-the-art global structure matching and demonstrates the ability to identify common local structural motifs across proteins with different overall folds, including cases where sequence-based methods falter. The work suggests a shift toward structure-centric representations for function inference and highlights the importance of focusing on local geometries when structure alignment is the primary objective.

Abstract

Deep learning has become a crucial tool in studying proteins. While the significance of modeling protein structure has been discussed extensively in the literature, amino acid types are typically included in the input as a default operation for many inference tasks. This study demonstrates with structure alignment task that embedding amino acid types in some cases may not help a deep learning model learn better representation. To this end, we propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. The effectiveness of ProtLOCA is examined by a global structure-matching task on protein pairs with an independent test dataset based on CATH labels. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains. Furthermore, in local structure pairing tasks, ProtLOCA for the first time provides a valid solution to highlight common local structures among proteins with different overall structures but the same function. This suggests a new possibility for using deep learning methods to analyze protein structure to infer function.

Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

TL;DR

Abstract

Paper Structure (27 sections, 9 equations, 4 figures, 1 table)

This paper contains 27 sections, 9 equations, 4 figures, 1 table.

Introduction
Global Structure Matching
Problem Formulation
Feature Representation
Model Architecture
Training Objective
Local Structure Alignment
Problem Formulation
ProtLOCA for Local Structure Alignment
Candidate Selection
Redundancy Removal
Unconditional Ranking
Experimental Analysis
CATH-aligns: Benchmark for Structure Alignment
Experimental Protocol
...and 12 more sections

Figures (4)

Figure 1: An illustrative pipeline of ProtLOCA for structure pairing (see Section \ref{['sec:globalMatch']}). We employ ProtLOCA to extract protein vector representations for protein structures and calculate the cosine similarity between the learned hidden representation of protein pairs.
Figure 2: An illustrative pipeline of ProtLOCA for local structure alignment (see Section \ref{['sec:localMatch']}). We employ ProtLOCA for residue-level point-to-point matching, which identifies similar local structures on proteins with different overall structures.
Figure 3: Model performance on different (left) perturbation possibility $p$ on mask corruption; (middle) number of GVP layers; (right) pre-training targets.
Figure 4: Example of using ProtLOCA and TM-align to find Helix-turn-helix (HTH) motif in DNA binding protein. (A) HTH motif in Tox repressor (PDB: 1F5T). The HTH motif is colored in red, DNA in yellow, and protein in white. (B) The HTH motif serves as the binding site of protein to DNA and is presented as a Tox repressor. The HTH motif is colored in pink, the protein is in white, the DNA is in yellow, and the hydrogen bonds between the HTH motif and DNA are marked in red. (C) phage lambda cII protein (PDB: 1ZS4) HTH motif from ground truth (red), TM-align (blue), and ProtLOCA (green). (D) transcriptional regulator PA2196 (PDB: 4L62) HTH motif from ground truth (red), TM-align (blue), and ProtLOCA (green).

Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

TL;DR

Abstract

Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)