Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines
Harshit Goyal
TL;DR
The paper tackles cost-efficient resource provisioning in cloud-based big data pipelines by forecasting resource utilization with a Random Forest regression model trained on Google Borg traces. A comprehensive preprocessing pipeline converts semi-structured Borg logs into a compact numeric feature set, enabling accurate predictions with $R^2 = 0.991$, $MAE = 0.0048$, and $RMSE = 0.137$. The results show strong accuracy for common workloads and reveal increased variance for rare large jobs, highlighting data-balancing needs and future multi-resource extensions. The work also details a practical deployment pathway for integrating model-driven predictions into Kubernetes/YARN schedulers to enable proactive autoscaling and tangible cost savings while preserving SLA reliability.
Abstract
Efficient resource allocation is a key challenge in modern cloud computing. Over-provisioning leads to unnecessary costs, while under-provisioning risks performance degradation and SLA violations. This work presents an artificial intelligence approach to predict resource utilization in big data pipelines using Random Forest regression. We preprocess the Google Borg cluster traces to clean, transform, and extract relevant features (CPU, memory, usage distributions). The model achieves high predictive accuracy (R Square = 0.99, MAE = 0.0048, RMSE = 0.137), capturing non-linear relationships between workload characteristics and resource utilization. Error analysis reveals impressive performance on small-to-medium jobs, with higher variance in rare large-scale jobs. These results demonstrate the potential of AI-driven prediction for cost-aware autoscaling in cloud environments, reducing unnecessary provisioning while safeguarding service quality.
