A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

Chenxi Yang; Yan Li; Martin Maas; Mustafa Uysal; Ubaid Ullah Hafeez; Arif Merchant; Richard McDougall

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

Chenxi Yang, Yan Li, Martin Maas, Mustafa Uysal, Ubaid Ullah Hafeez, Arif Merchant, Richard McDougall

TL;DR

This paper targets the high storage-related TCO in warehouse-scale computers and acknowledges the practical challenges of deploying monolithic ML models for data placement. It introduces a Bring-Your-Own-Model cross-layer design where workloads train lightweight application-layer models that predict workload importance; a co-designed storage-layer heuristic uses these predictions to drive data placement decisions. Through a production Google prototype and large-scale simulations on production traces, the approach delivers substantial savings, including up to $3.47\times$ TCO improvements and additional $3.22\%$ savings in production contexts, while maintaining low inference latency and interpretability. The work demonstrates the viability, robustness, and generalizability of cross-layer ML for storage systems, and suggests this design philosophy for practical ML deployment in complex infrastructure.

Abstract

Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data centers at Google, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world deployments with frequently changing workloads. To address this problem, we introduce a cross-layer approach where workloads instead ''bring their own model''. This strategy moves ML out of the storage system and instead allows each workload to train its own lightweight model at the application layer, capturing the workload's specific characteristics. These small, interpretable models generate predictions that guide a co-designed scheduling heuristic at the storage layer, enabling adaptation to diverse online environments. We build a proof-of-concept of this approach in a production distributed computation framework at Google. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47$\times$ in TCO savings compared to state-of-the-art baselines.

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

TL;DR

TCO improvements and additional

savings in production contexts, while maintaining low inference latency and interpretability. The work demonstrates the viability, robustness, and generalizability of cross-layer ML for storage systems, and suggests this design philosophy for practical ML deployment in complex infrastructure.

Abstract

in TCO savings compared to state-of-the-art baselines.

Paper Structure (33 sections, 5 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 16 figures, 4 tables, 1 algorithm.

Introduction
Background
Storage for Data Processing Frameworks
SSD/HDD Tiering and its Trade-Offs
Production Requirements and Limitations
Google's Production Setup
Problem Formulation & Baselines
Oracle: Optimal Solution Based on Solver.
FirstFit: Static Placement.
Heuristic: Practical Adaptive Placement.
ML Baseline: Lifetime Prediction-Based.
Hybrid Learning Approach
Features
Model Design
Adaptive Category Selection Algorithm
...and 18 more sections

Figures (16)

Figure 1: Workloads show vastly different storage patterns.
Figure 2: Conceptual overview of the monolithic approach vs. the cross-layer approach.
Figure 3: Left: Data flow graph in a data processing framework. Data is processed in parallel and its jobs create intermediate files (blue) which are inputs for the next processing step. Right: Approach Overview. We analyze production workloads offline for model design and training. Online, each application's model predicts job properties and passes the prediction to the storage layer for job placement.
Figure 4: I/O density and TCO savings of each job (color shows oracle placement decision when optimizing for TCO). Tested under different SSD quota.
Figure 5: Prototype results.
...and 11 more figures

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

TL;DR

Abstract

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

Authors

TL;DR

Abstract

Table of Contents

Figures (16)