Table of Contents
Fetching ...

Enhancing non-Perl bioinformatic applications with Perl: Building novel, component based applications using Object Orientation, PDL, Alien, FFI, Inline and OpenMP

Christos Argyropoulos

TL;DR

The paper investigates how Perl can play a pivotal role in building high-level, component-based bioinformatics applications for data- and compute-intensive workloads, especially in the era of long-read sequencing. It demonstrates two case studies, Polyester and Edlib, to show how Perl's object-oriented and meta-programming capabilities, together with PDL, Alien, FFI, Inline, and OpenMP, can compose and accelerate heterogeneous components. The authors implement Perl wrappers around an RNA-seq simulator and a fast sequence-aligner, introduce new PDL-based randomness and IO modules, and develop data-flow abstractions to enable scalable parallelism via MCE and OpenMP. The results indicate that Perl can achieve competitive performance when combined with vectorization (PDL) and cross-language integration, supporting reusable, parallel pipelines for modern sequencing analyses and motivating broader use of Perl in Bio::SeqAlignment-style frameworks.

Abstract

Component-Based Software Engineering (CBSE) is a methodology that assembles pre-existing, re-usable software components into new applications, which is particularly relevant for fast moving, data-intensive fields such as bioinformatics. While Perl was used extensively in this field until a decade ago, more recent applications opt for a Bioconductor/R or Python. This trend represents a significantly missed opportunity for the rapid generation of novel bioinformatic applications out of pre-existing components since Perl offers a variety of abstractions that can facilitate composition. In this paper, we illustrate the utility of Perl for CBSE through a combination of Object Oriented frameworks, the Perl Data Language and facilities for interfacing with non-Perl code through Foreign Function Interfaces and inlining of foreign source code. To do so, we enhance Polyester, a RNA sequencing simulator written in R, and edlib a fast sequence similarity search library based on the edit distance. The first case study illustrates the near effortless authoring of new, highly performant Perl modules for the simulation of random numbers using the GNU Scientific Library and PDL, and proposes Perl and Perl/C alternatives to the Python tool cutadapt that is used to "trim" polyA tails from biological sequences. For the edlib case, we leverage the power of metaclass programming to endow edlib with coarse, process based parallelism, through the Many Core Engine (MCE) module and fine grained parallelism through OpenMP, a C/C++/Fortran Application Programming Interface for shared memory multithreaded processing. These use cases provide proof-of-concept for the Bio::SeqAlignment framework, which can organize heterogeneous components in complex memory and command-line based workflows for the construction of novel bionformatic tools to analyze data from long-read sequencing, e.g. Nanopore, sequencing platforms.

Enhancing non-Perl bioinformatic applications with Perl: Building novel, component based applications using Object Orientation, PDL, Alien, FFI, Inline and OpenMP

TL;DR

The paper investigates how Perl can play a pivotal role in building high-level, component-based bioinformatics applications for data- and compute-intensive workloads, especially in the era of long-read sequencing. It demonstrates two case studies, Polyester and Edlib, to show how Perl's object-oriented and meta-programming capabilities, together with PDL, Alien, FFI, Inline, and OpenMP, can compose and accelerate heterogeneous components. The authors implement Perl wrappers around an RNA-seq simulator and a fast sequence-aligner, introduce new PDL-based randomness and IO modules, and develop data-flow abstractions to enable scalable parallelism via MCE and OpenMP. The results indicate that Perl can achieve competitive performance when combined with vectorization (PDL) and cross-language integration, supporting reusable, parallel pipelines for modern sequencing analyses and motivating broader use of Perl in Bio::SeqAlignment-style frameworks.

Abstract

Component-Based Software Engineering (CBSE) is a methodology that assembles pre-existing, re-usable software components into new applications, which is particularly relevant for fast moving, data-intensive fields such as bioinformatics. While Perl was used extensively in this field until a decade ago, more recent applications opt for a Bioconductor/R or Python. This trend represents a significantly missed opportunity for the rapid generation of novel bioinformatic applications out of pre-existing components since Perl offers a variety of abstractions that can facilitate composition. In this paper, we illustrate the utility of Perl for CBSE through a combination of Object Oriented frameworks, the Perl Data Language and facilities for interfacing with non-Perl code through Foreign Function Interfaces and inlining of foreign source code. To do so, we enhance Polyester, a RNA sequencing simulator written in R, and edlib a fast sequence similarity search library based on the edit distance. The first case study illustrates the near effortless authoring of new, highly performant Perl modules for the simulation of random numbers using the GNU Scientific Library and PDL, and proposes Perl and Perl/C alternatives to the Python tool cutadapt that is used to "trim" polyA tails from biological sequences. For the edlib case, we leverage the power of metaclass programming to endow edlib with coarse, process based parallelism, through the Many Core Engine (MCE) module and fine grained parallelism through OpenMP, a C/C++/Fortran Application Programming Interface for shared memory multithreaded processing. These use cases provide proof-of-concept for the Bio::SeqAlignment framework, which can organize heterogeneous components in complex memory and command-line based workflows for the construction of novel bionformatic tools to analyze data from long-read sequencing, e.g. Nanopore, sequencing platforms.
Paper Structure (23 sections, 6 figures, 4 tables)

This paper contains 23 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: PALS-NS experimental workflow (A), text- based model of a well-formed read (B) and custom bioinformatics pipeline (C) to remove decorators. Dashed boxes indicate modifications to biochemical protocols, read models and bioinformatics pipeline for the SMART protocol. Abbreviations: SSP, Strand Switching Primer, PCR, Polymerase Chain Reaction, RT, Reverse Transcription, VNP, poly-thymidine based primer. Figure reproduced from mackenzie_make_2022, available under the CC-BY 4.0 International license.
  • Figure 2: Performance evaluation of Random Number Generation in R (upper panel) and Perl (lower panel). The figure shows violin plots of 1,000 replicates of a task that involves simulating one million random numbers from the log-normal distribution with mean parameter $log(125)$, scale parameter $1$, truncated to the interval $[0,250]$. The performance of the Xoshiro RNG (DQRNG in R, PDLUNIF in Perl) was evaluated vis-a-vis the builtin RNGs (UNIF in R, PERLRNG in Perl) and a the uniform RNG from GSL in Perl). Other scenarios evaluated included: a) the impact of object construction (WO_OC used a single object for all 1,000 replications v.s. WITH_OC that created a new object for each replicate), b) using non-vectorized base Perl for RNG (PERLRNG) instead of PDL (PDLUNIF) c) using non-vectorized, base Perl, for the inverse CDF method (MATHGSL) and d) alternatives in R to provide dependencies such as the RNG and the (inverse)CDF functions of the target distribution apndx:RNGCodeInR
  • Figure 3: Data flows for sequence mapping
  • Figure 4: Benchmarking a pure MCE and a pure OpenMP for an equal number of workers (processed under MCE or threads under OpenMP). The figure shows experimental data (OpenMP : triangles, MCE: filled circles), regression curve fits and 95% confidence intervals (gray bands)
  • Figure 5: Contour plot of the performance of Edlib_MCE_OpenMP over combinations of different number of threads and processes. The points highlighted via the '+' are the top 5% performers(smallest execution time).
  • ...and 1 more figures