Flora: Efficient Cloud Resource Selection for Big Data Processing via Job Classification
Jonathan Will, Lauritz Thamsen, Jonathan Bader, Odej Kao
TL;DR
Flora addresses the challenge of cost-efficient cloud resource selection for big data processing by classifying jobs into memory-demanding and memory-yielding categories and ranking configurations using current prices and historical profiling. It employs a three-step workflow: infrastructure profiling, job classification, and cost-aware ranking of configurations, enabling near-optimal cost with minimal per-job overhead after an initial profiling phase. Evaluation on a Google Cloud Spark trace shows Flora achieves an average deviation of about 5% from the cost-optimal choice (maximum under 25%), while maintaining reasonable runtimes. The approach is robust to pricing dynamics and misclassification scenarios, with potential for deployment by cloud providers to offer cost-effective processing for diverse workloads.
Abstract
Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters of cloud resources. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual resource allocations, such as memory and CPU cores, must meet the specific resource demands of the job. Meanwhile, the choices of cloud configurations are often plentiful, especially in public clouds, and the current cost of the available resource options can fluctuate. Addressing this challenge, we present Flora, a low-overhead approach to cost-optimizing cloud cluster configurations for big data processing. Flora lets users categorize jobs according to their data access patterns and derives suitable cluster resource configurations from executions of test jobs of the same category, considering current resource costs. In our evaluation on a new dataset comprising 180 Spark job executions on Google Cloud, Flora's cluster resource selections exhibit an average deviation below 6% from the most cost-optimal solution, with a maximum deviation below 24%.
