Table of Contents
Fetching ...

A Scalable Tool For Analyzing Genomic Variants Of Humans Using Knowledge Graphs and Machine Learning

Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo Simoes, Praveen Rao

TL;DR

This paper tackles the challenge of integrating heterogeneous genomic variant data with rich relational context for scalable analysis. It proposes a pipeline that converts VCF data and CADD scores into a knowledge graph, enriched with SnpEff annotations and patient metadata, stored in BlazeGraph, and made accessible for graph ML via Deep Graph Library. VariantKG enables three workflows—graph enrichment, graph creation, and graph ML inference—allowing users to incorporate new data and perform node classification with GraphSAGE and GCNs on a COVID-19 RNA-seq variant dataset. The approach demonstrates scalability to billions of triples and provides a practical framework for efficient querying and ML on genomic graphs, with potential applicability beyond COVID-19 to broader genomic research.

Abstract

The integration of knowledge graphs and graph machine learning (GML) in genomic data analysis offers several opportunities for understanding complex genetic relationships, especially at the RNA level. We present a comprehensive approach for leveraging these technologies to analyze genomic variants, specifically in the context of RNA sequencing (RNA-seq) data from COVID-19 patient samples. The proposed method involves extracting variant-level genetic information, annotating the data with additional metadata using SnpEff, and converting the enriched Variant Call Format (VCF) files into Resource Description Framework (RDF) triples. The resulting knowledge graph is further enhanced with patient metadata and stored in a graph database, facilitating efficient querying and indexing. We utilize the Deep Graph Library (DGL) to perform graph machine learning tasks, including node classification with GraphSAGE and Graph Convolutional Networks (GCNs). Our approach demonstrates significant utility using our proposed tool, VariantKG, in three key scenarios: enriching graphs with new VCF data, creating subgraphs based on user-defined features, and conducting graph machine learning for node classification.

A Scalable Tool For Analyzing Genomic Variants Of Humans Using Knowledge Graphs and Machine Learning

TL;DR

This paper tackles the challenge of integrating heterogeneous genomic variant data with rich relational context for scalable analysis. It proposes a pipeline that converts VCF data and CADD scores into a knowledge graph, enriched with SnpEff annotations and patient metadata, stored in BlazeGraph, and made accessible for graph ML via Deep Graph Library. VariantKG enables three workflows—graph enrichment, graph creation, and graph ML inference—allowing users to incorporate new data and perform node classification with GraphSAGE and GCNs on a COVID-19 RNA-seq variant dataset. The approach demonstrates scalability to billions of triples and provides a practical framework for efficient querying and ML on genomic graphs, with potential applicability beyond COVID-19 to broader genomic research.

Abstract

The integration of knowledge graphs and graph machine learning (GML) in genomic data analysis offers several opportunities for understanding complex genetic relationships, especially at the RNA level. We present a comprehensive approach for leveraging these technologies to analyze genomic variants, specifically in the context of RNA sequencing (RNA-seq) data from COVID-19 patient samples. The proposed method involves extracting variant-level genetic information, annotating the data with additional metadata using SnpEff, and converting the enriched Variant Call Format (VCF) files into Resource Description Framework (RDF) triples. The resulting knowledge graph is further enhanced with patient metadata and stored in a graph database, facilitating efficient querying and indexing. We utilize the Deep Graph Library (DGL) to perform graph machine learning tasks, including node classification with GraphSAGE and Graph Convolutional Networks (GCNs). Our approach demonstrates significant utility using our proposed tool, VariantKG, in three key scenarios: enriching graphs with new VCF data, creating subgraphs based on user-defined features, and conducting graph machine learning for node classification.
Paper Structure (22 sections, 14 figures, 3 tables)

This paper contains 22 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Workflow to demonstrate raw data collection to further process and collate into a dataset.
  • Figure 2: Additional annotations by the SnpEff tool.
  • Figure 3: Ontology of the knowledge graph.
  • Figure 4: Ontology for CADD Scores.
  • Figure 5: Architecure of GCN.
  • ...and 9 more figures