idMotif: An Interactive Motif Identification in Protein Sequences
Ji Hwan Park, Vikash Prasad, Sydney Newsom, Fares Najar, Rakhi Rajan
TL;DR
idMotif addresses the challenge of interactive motif discovery in protein sequences by combining a deep-learning–based sequence classifier with SHAP-based saliency explanations and a five-view visual analytics interface. It leverages ProtBert embeddings, fine-tunes a Transformer for group classification, and uses local explanations to reveal motif candidates as salient regions across protein groups, encapsulated in Projection, Cluster, Sequence, Motif, and Distribution views. The approach is demonstrated on Cas1 CRISPR-Cas data, with case-study insights and expert feedback indicating improved motif identification and outlier detection compared to static motif discovery methods like MEME. The framework is designed so it can generalize to DNA/RNA by swapping in an appropriate pre-trained sequence model, enabling motif discovery across genetic data domains.
Abstract
This article introduces idMotif, a visual analytics framework designed to aid domain experts in the identification of motifs within protein sequences. Motifs, short sequences of amino acids, are critical for understanding the distinct functions of proteins. Identifying these motifs is pivotal for predicting diseases or infections. idMotif employs a deep learning-based method for the categorization of protein sequences, enabling the discovery of potential motif candidates within protein groups through local explanations of deep learning model decisions. It offers multiple interactive views for the analysis of protein clusters or groups and their sequences. A case study, complemented by expert feedback, illustrates idMotif's utility in facilitating the analysis and identification of protein sequences and motifs.
