Table of Contents
Fetching ...

LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform

Lizhi Ma, Yi-Xiang Hu, Yuke Wang, Yifang Zhao, Yihui Ren, Jian-Xiang Liao, Feng Wu, Xiang-Yang Li

TL;DR

LeJOT tackles rising Databricks operational costs by combining a predictive execution-time model with a solver-based optimization framework for dynamic resource allocation. The approach predicts job runtimes using Ridge Regression on carefully engineered features and optimizes resource use via Branch-and-Bound on a formal cost-minimization formulation. Experimental results on real Lenovo Databricks workloads show average cloud cost reductions around 20% with robust scheduling performance. This work demonstrates a scalable, proactive pathway for cost-efficient orchestration in Data Lakehouse environments.

Abstract

With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.

LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform

TL;DR

LeJOT tackles rising Databricks operational costs by combining a predictive execution-time model with a solver-based optimization framework for dynamic resource allocation. The approach predicts job runtimes using Ridge Regression on carefully engineered features and optimizes resource use via Branch-and-Bound on a formal cost-minimization formulation. Experimental results on real Lenovo Databricks workloads show average cloud cost reductions around 20% with robust scheduling performance. This work demonstrates a scalable, proactive pathway for cost-efficient orchestration in Data Lakehouse environments.

Abstract

With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.

Paper Structure

This paper contains 18 sections, 17 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The overview of LeJOT framework. The diagram is divided into three sequential execution parts: (a) Input: Provide the earliest start time and latest end time for job expectations, as well as their dependencies between jobs. (b) Runtime Prediction: Use ML algorithms to predict each job’s runtime under different resource allocations. (c) Optimal Resource Allocation: Solve for the lowest cost resource allocation that meets dependencies and time constraints.