Table of Contents
Fetching ...

MirLibSpark: A Scalable NGS Plant MicroRNA Prediction Pipeline for Multi-Library Functional Annotation

Chao-Jung Wu, Amine M. Remita, Abdoulaye Baniré Diallo

TL;DR

MirLibSpark addresses the scalability gap in plant miRNA prediction by delivering a Spark-based, end-to-end pipeline that combines miRNA prediction with multi-library functional annotation. It introduces a modular architecture (M1–M7) with a core miRNA predictor (M2) and leverages distributed data processing (RDDs) to handle large genomes and numerous libraries. Benchmarking demonstrates faster execution and competitive accuracy compared to existing tools, including substantial speedups on real and simulated datasets and strong scalability with increasing core counts. The work advances practical, automated plant miRNA analysis suitable for large-scale datasets and multi-library studies, with open-source availability and deployment options.

Abstract

The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved pipeline for a high volume data facility by implementing mirLibSpark based on the Apache Spark framework. This pipeline is the fastest actual method, and provides an accuracy improvement compared to the standard. In this paper, we deliver the first distributed functional miRNA predictor as a standalone and fully automated package. It is an efficient and accurate miRNA predictor with functional insight. Furthermore, it compiles with the gold-standard requirement on plant miRNA predictions.

MirLibSpark: A Scalable NGS Plant MicroRNA Prediction Pipeline for Multi-Library Functional Annotation

TL;DR

MirLibSpark addresses the scalability gap in plant miRNA prediction by delivering a Spark-based, end-to-end pipeline that combines miRNA prediction with multi-library functional annotation. It introduces a modular architecture (M1–M7) with a core miRNA predictor (M2) and leverages distributed data processing (RDDs) to handle large genomes and numerous libraries. Benchmarking demonstrates faster execution and competitive accuracy compared to existing tools, including substantial speedups on real and simulated datasets and strong scalability with increasing core counts. The work advances practical, automated plant miRNA analysis suitable for large-scale datasets and multi-library studies, with open-source availability and deployment options.

Abstract

The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved pipeline for a high volume data facility by implementing mirLibSpark based on the Apache Spark framework. This pipeline is the fastest actual method, and provides an accuracy improvement compared to the standard. In this paper, we deliver the first distributed functional miRNA predictor as a standalone and fully automated package. It is an efficient and accurate miRNA predictor with functional insight. Furthermore, it compiles with the gold-standard requirement on plant miRNA predictions.

Paper Structure

This paper contains 18 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: MirLibSpark overview. M1: genomic reference annotations. M2: miRNA prediction. M3: target gene prediction. M4: differential expression analysis. M5: precursor visualization. M6: KEGG pathway annotation. M7: functional enrichment analysis. M for module.
  • Figure 2: MirLibSpark miRNA prediction module.
  • Figure 3: Comparative performance of miRNA annotation software from plants.
  • Figure 4: The venn diagram characterizing the predictions of GSE44622 by Jeong et al jeong2013comprehensive and mirLibSpark, after a clustering step using blastn to group together similar sequences.