Serverless Approach to Running Resource-Intensive STAR Aligner
Piotr Kica, Michał Orzechowski, Maciej Malawski
TL;DR
The study tackles the challenge of running the memory-intensive STAR RNA-seq aligner in serverless environments. It evaluates serverless options and implements a STAR-based pipeline on AWS ECS in Fargate, loading a genome index from shared storage and processing large FASTQ datasets. Experiments comparing ECS-Fargate with EC2 show that while serverless deployment is feasible, it yields slower performance and higher costs for large-scale data, with memory limits causing failures in some runs; cost-saving strategies and divide-and-conquer approaches offer potential improvements. The work demonstrates serverless STAR viability for small to medium batch workloads and highlights practical trade-offs between serverless and VM-based deployments, pointing to targeted optimizations for broader scalability.
Abstract
The application of serverless computing for alignment of RNA-sequences can improve many existing bioinformatics workflows by reducing operational costs and execution times. This work analyzes the applicability of serverless services for running the STAR aligner, which is known for its accuracy and large memory requirement. This presents a challenge, as serverless services were designed for light and short tasks. Nevertheless, we successfully deploy a STAR-based pipeline on AWS ECS service, propose multiple optimizations, and perform experiment with 17 TBs of data. Results are compared against standard virtual machine (VM) based solution showing that serverless is a valid alternative for small-scale batch processing. However, in large-scale where efficiency matters the most, VMs are still recommended.
