Table of Contents
Fetching ...

Hashing for Protein Structure Similarity Search

Jin Han, Wu-Jun Li

TL;DR

In this paper, a novel method, called POSH, is proposed, which learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods.

Abstract

Protein structure similarity search (PSSS), which tries to search proteins with similar structures, plays a crucial role across diverse domains from drug design to protein function prediction and molecular evolution. Traditional alignment-based PSSS methods, which directly calculate alignment on the protein structures, are highly time-consuming with high memory cost. Recently, alignment-free methods, which represent protein structures as fixed-length real-valued vectors, are proposed for PSSS. Although these methods have lower time and memory cost than alignment-based methods, their time and memory cost is still too high for large-scale PSSS, and their accuracy is unsatisfactory. In this paper, we propose a novel method, called $\underline{\text{p}}$r$\underline{\text{o}}$tein $\underline{\text{s}}$tructure $\underline{\text{h}}$ashing (POSH), for PSSS. POSH learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods. Furthermore, in POSH we also propose expressive hand-crafted features and a structure encoder to well model both node and edge interactions in proteins. Experimental results on real datasets show that POSH can outperform other methods to achieve state-of-the-art accuracy. Furthermore, POSH achieves a memory saving of more than six times and speed improvement of more than four times, compared with other methods.

Hashing for Protein Structure Similarity Search

TL;DR

In this paper, a novel method, called POSH, is proposed, which learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods.

Abstract

Protein structure similarity search (PSSS), which tries to search proteins with similar structures, plays a crucial role across diverse domains from drug design to protein function prediction and molecular evolution. Traditional alignment-based PSSS methods, which directly calculate alignment on the protein structures, are highly time-consuming with high memory cost. Recently, alignment-free methods, which represent protein structures as fixed-length real-valued vectors, are proposed for PSSS. Although these methods have lower time and memory cost than alignment-based methods, their time and memory cost is still too high for large-scale PSSS, and their accuracy is unsatisfactory. In this paper, we propose a novel method, called rtein tructure ashing (POSH), for PSSS. POSH learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods. Furthermore, in POSH we also propose expressive hand-crafted features and a structure encoder to well model both node and edge interactions in proteins. Experimental results on real datasets show that POSH can outperform other methods to achieve state-of-the-art accuracy. Furthermore, POSH achieves a memory saving of more than six times and speed improvement of more than four times, compared with other methods.

Paper Structure

This paper contains 33 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The architecture of POSH
  • Figure 2: Illustration of bond angles and dihedral angles. The letter R denotes the side chain of the amino acid.
  • Figure 3: POSH for protein structure similarity search. (a) In the training phase, the distance between binary hash codes will be minimized or maximized depending on the original similarity of the samples. (b) In the testing phase, the binary hash code for the new-coming query protein structure can be obtained from POSH, and then be used to search the database to get similar protein structures.
  • Figure 4: Time cost of searching in databases of different sizes
  • Figure 5: Comprehensive performance comparison