Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?
Yang Tan, Lirong Zheng, Bozitao Zhong, Liang Hong, Bingxin Zhou
TL;DR
The paper questions the blanket benefit of incorporating amino acid sequence information into protein representations for structure-related tasks. It introduces ProtLOCA, a roto-equivariant GVP-based framework that encodes local geometric structure without relying on amino acid types, and validates it on global and local structure alignment tasks. On independent CATH-based benchmarks, ProtLOCA achieves state-of-the-art global structure matching and demonstrates the ability to identify common local structural motifs across proteins with different overall folds, including cases where sequence-based methods falter. The work suggests a shift toward structure-centric representations for function inference and highlights the importance of focusing on local geometries when structure alignment is the primary objective.
Abstract
Deep learning has become a crucial tool in studying proteins. While the significance of modeling protein structure has been discussed extensively in the literature, amino acid types are typically included in the input as a default operation for many inference tasks. This study demonstrates with structure alignment task that embedding amino acid types in some cases may not help a deep learning model learn better representation. To this end, we propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. The effectiveness of ProtLOCA is examined by a global structure-matching task on protein pairs with an independent test dataset based on CATH labels. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains. Furthermore, in local structure pairing tasks, ProtLOCA for the first time provides a valid solution to highlight common local structures among proteins with different overall structures but the same function. This suggests a new possibility for using deep learning methods to analyze protein structure to infer function.
