Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

Chenghao Lyu; Qi Fan; Fei Song; Arnab Sinha; Yanlei Diao; Wei Chen; Li Ma; Yihui Feng; Yaliang Li; Kai Zeng; Jingren Zhou

Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

Chenghao Lyu, Qi Fan, Fei Song, Arnab Sinha, Yanlei Diao, Wei Chen, Li Ma, Yihui Feng, Yaliang Li, Kai Zeng, Jingren Zhou

TL;DR

This work tackles resource optimization for production-scale big data processing on MaxCompute by decomposing decisions into partition counts, instance placements, and per-instance resource allocations, all under multiple objectives. It introduces fine-grained instance-level modeling via Multi-Channel Coverage (MCI) and uses a Graph Transformer Networks-based plan embedder to predict latent per-instance latencies. A two-step Stage-level Optimizer combines Intelligent Placement Advisor (IPA) for latency-aware placement and Resource Assignment Advisor (RAA) for instance-specific resource tuning within a hierarchical MOO framework. Evaluations on real MaxCompute traces demonstrate substantial latency and cost reductions (up to 72% latency and 78% cost) while maintaining subsecond optimization times, showing strong practical impact for production cloud data platforms.

Abstract

Big data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute-based integrated system to support multi-objective resource optimization via fine-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level recommendations in a hierarchical MOO framework. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s.

Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

TL;DR

Abstract

Paper Structure (50 sections, 8 theorems, 16 equations, 33 figures, 13 tables, 4 algorithms)

This paper contains 50 sections, 8 theorems, 16 equations, 33 figures, 13 tables, 4 algorithms.

Introduction
Related Work
System Overview
Background on MaxCompute
System Design for Resource Optimization
Fine-Grained Modeling
Multi-Channel Coverage
MCI-based Models
Stage-level optimization
MOO Problem and Our Approach
Intelligent Placement Advisor (IPA)
Resource Assignment Advisor (RAA)
Experiments
Model Evaluation
Resource Optimization (RO) Evaluation
...and 35 more sections

Key Result

Theorem 5.1

IPA achieves the single-objective stage-latency optimality under the column-order assumption.

Figures (33)

Figure 1: The lifecycle of a query job in MaxCompute
Figure 2: Trace overview
Figure 3: Extended system architecture for resource optimization
Figure 4: The multi-channel coverage from basic features
Figure 5: MCI-based modeling framework
...and 28 more figures

Theorems & Definitions (11)

Definition 5.1
Definition 5.2
Theorem 5.1
Definition 5.3
Proposition 5.1
Proposition 5.2
Proposition D.1
Theorem D.1
Lemma 1
Proposition E.1
...and 1 more

Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

TL;DR

Abstract

Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (33)

Theorems & Definitions (11)