Table of Contents
Fetching ...

PanDelos-plus: A parallel algorithm for computing sequence homology in pangenomic analysis

Simone Colli, Emiliano Maresi, Vincenzo Bonnici

TL;DR

PanDelos-plus, a fully parallel, gene-centric redesign of PanDelos, enables population-scale comparative genomics to be performed on standard multicore workstations, making large-scale bacterial pangenome analysis accessible for routine use in everyday research.

Abstract

The identification of homologous gene families across multiple genomes is a central task in bacterial pangenomics traditionally requiring computationally demanding all-against-all comparisons. PanDelos addresses this challenge with an alignment-free and parameter-free approach based on k-mer profiles, combining high speed, ease of use, and competitive accuracy with state-of-the-art methods. However, the increasing availability of genomic data requires tools that can scale efficiently to larger datasets. To address this need, we present PanDelos-plus, a fully parallel, gene-centric redesign of PanDelos. The algorithm parallelizes the most computationally intensive phases (Best Hit detection and Bidirectional Best Hit extraction) through data decomposition and a thread pool strategy, while employing lightweight data structures to reduce memory usage. Benchmarks on synthetic datasets show that PanDelos-plus achieves up to 14x faster execution and reduces memory usage by up to 96%, while maintaining accuracy. These improvements enable population-scale comparative genomics to be performed on standard multicore workstations, making large-scale bacterial pangenome analysis accessible for routine use in everyday research.

PanDelos-plus: A parallel algorithm for computing sequence homology in pangenomic analysis

TL;DR

PanDelos-plus, a fully parallel, gene-centric redesign of PanDelos, enables population-scale comparative genomics to be performed on standard multicore workstations, making large-scale bacterial pangenome analysis accessible for routine use in everyday research.

Abstract

The identification of homologous gene families across multiple genomes is a central task in bacterial pangenomics traditionally requiring computationally demanding all-against-all comparisons. PanDelos addresses this challenge with an alignment-free and parameter-free approach based on k-mer profiles, combining high speed, ease of use, and competitive accuracy with state-of-the-art methods. However, the increasing availability of genomic data requires tools that can scale efficiently to larger datasets. To address this need, we present PanDelos-plus, a fully parallel, gene-centric redesign of PanDelos. The algorithm parallelizes the most computationally intensive phases (Best Hit detection and Bidirectional Best Hit extraction) through data decomposition and a thread pool strategy, while employing lightweight data structures to reduce memory usage. Benchmarks on synthetic datasets show that PanDelos-plus achieves up to 14x faster execution and reduces memory usage by up to 96%, while maintaining accuracy. These improvements enable population-scale comparative genomics to be performed on standard multicore workstations, making large-scale bacterial pangenome analysis accessible for routine use in everyday research.

Paper Structure

This paper contains 14 sections, 13 equations, 4 figures, 3 tables, 3 algorithms.

Figures (4)

  • Figure 1: PanDelos-plus pipeline (revisiting PanDelos) in three stages: (a) automatic selection of the optimal k-mer length; (b) all-against-all gene comparison to generate Bidirectional Best Hits (BBH), with three subphases—serial construction of multiplicity vectors (k-mer extraction and grouping), parallel Best Hit detection via generalized Jaccard similarity (stored in a jagged array), and parallel BBH extraction; (c) final clustering into gene families based on the BBH.
  • Figure 2: Strong scaling of PanDelos-plus on real datasets. Top: Peak memory usage (MB) remains nearly constant as the number of threads increases, indicating minimal overhead from parallelization. Bottom: Execution time (s) decreases almost inversely with the number of threads, confirming efficient parallel scalability across all datasets.
  • Figure 3: Scaling behavior of PanDelos-plus on a synthetic dataset of 50 genomes. Top: Memory usage (GB) remains almost constant as the number of threads increases, while PanDelos shows much higher consumption $27.2 GB$ vs $2.8 GB$. Bottom: Execution time (s) decreases almost linearly with thread count, from approximately $34942 s$ (9.6 h) with 1 thread to about $1963 s$ (0.5 h) with 32 threads, achieving up to 18x acceleration compared to the single-threaded execution.
  • Figure 4: Scalability of PanDelos-plus with increasing dataset size ($50-600$ genomes) at $32$ threads. Top: Peak memory usage grows approximately linearly with the number of genomes, remaining below $40 GB$ even for $600$ genomes. Bottom: Execution time scales sublinearly with dataset size, increasing from $\approx 0.5$ h ($50$ genomes) to $\approx 62$ h ($600$ genomes), indicating near-linear time complexity and effective parallel efficiency even on very large datasets.