Table of Contents
Fetching ...

idMotif: An Interactive Motif Identification in Protein Sequences

Ji Hwan Park, Vikash Prasad, Sydney Newsom, Fares Najar, Rakhi Rajan

TL;DR

idMotif addresses the challenge of interactive motif discovery in protein sequences by combining a deep-learning–based sequence classifier with SHAP-based saliency explanations and a five-view visual analytics interface. It leverages ProtBert embeddings, fine-tunes a Transformer for group classification, and uses local explanations to reveal motif candidates as salient regions across protein groups, encapsulated in Projection, Cluster, Sequence, Motif, and Distribution views. The approach is demonstrated on Cas1 CRISPR-Cas data, with case-study insights and expert feedback indicating improved motif identification and outlier detection compared to static motif discovery methods like MEME. The framework is designed so it can generalize to DNA/RNA by swapping in an appropriate pre-trained sequence model, enabling motif discovery across genetic data domains.

Abstract

This article introduces idMotif, a visual analytics framework designed to aid domain experts in the identification of motifs within protein sequences. Motifs, short sequences of amino acids, are critical for understanding the distinct functions of proteins. Identifying these motifs is pivotal for predicting diseases or infections. idMotif employs a deep learning-based method for the categorization of protein sequences, enabling the discovery of potential motif candidates within protein groups through local explanations of deep learning model decisions. It offers multiple interactive views for the analysis of protein clusters or groups and their sequences. A case study, complemented by expert feedback, illustrates idMotif's utility in facilitating the analysis and identification of protein sequences and motifs.

idMotif: An Interactive Motif Identification in Protein Sequences

TL;DR

idMotif addresses the challenge of interactive motif discovery in protein sequences by combining a deep-learning–based sequence classifier with SHAP-based saliency explanations and a five-view visual analytics interface. It leverages ProtBert embeddings, fine-tunes a Transformer for group classification, and uses local explanations to reveal motif candidates as salient regions across protein groups, encapsulated in Projection, Cluster, Sequence, Motif, and Distribution views. The approach is demonstrated on Cas1 CRISPR-Cas data, with case-study insights and expert feedback indicating improved motif identification and outlier detection compared to static motif discovery methods like MEME. The framework is designed so it can generalize to DNA/RNA by swapping in an appropriate pre-trained sequence model, enabling motif discovery across genetic data domains.

Abstract

This article introduces idMotif, a visual analytics framework designed to aid domain experts in the identification of motifs within protein sequences. Motifs, short sequences of amino acids, are critical for understanding the distinct functions of proteins. Identifying these motifs is pivotal for predicting diseases or infections. idMotif employs a deep learning-based method for the categorization of protein sequences, enabling the discovery of potential motif candidates within protein groups through local explanations of deep learning model decisions. It offers multiple interactive views for the analysis of protein clusters or groups and their sequences. A case study, complemented by expert feedback, illustrates idMotif's utility in facilitating the analysis and identification of protein sequences and motifs.
Paper Structure (31 sections, 2 equations, 10 figures)

This paper contains 31 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: A pipeline of the proposed motif identification method. (a) Given protein sequences, (b) we use a pre-trained model (ProtBert Elnaggar:2020) to generate embeddings for the sequences. Next, we (c) fine-tune a Transformer model (ProtBert) for sequence group prediction and then (d) apply a local explanation method (SHAP) to the sequences.
  • Figure 2: An architecture of our protein sequence prediction model, including (a) a pre-trained ProtBert and (b) a fine-tuning process.
  • Figure 3: idMotif contains five linked views. (a) The Cluster view shows an overview of clustered protein sequences. (b) The Sequence view presents the details of each protein sequence. (c) The Projection view displays the similarity of protein sequences. (d) The Motif view displays the details of a selected region in the Cluster view for discovering motifs. Lastly, (e) the Distribution view visualizes the length distribution of protein sequences in a selected cluster.
  • Figure 4: Comparison of two different projected data using UMAP: (a) the saliency values generated from SHAP analysis of the fine-tuned model, and (b) the embedding directly from the pre-trained ProtBert. Different colors indicate different groups of protein sequences.
  • Figure 5: An example of the Cluster view. The Cluster view can visualize two types of information: (a) the type of amino acids and (b) the saliency values of amino acids.
  • ...and 5 more figures