Table of Contents
Fetching ...

A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

Roozbeh Bostandoost, Pooria Namyar, Siva Kesava Reddy Kakarla, Ryan Beckett, Santiago Segarra, Eli Cortez, Ankur Mallick, Kevin Hsieh, Rodrigo Fonseca, Mohammad Hajiesmaili, Behnaz Arzani

TL;DR

This work presents SANJESH, a holistic analysis tool for ML-augmented systems that enables end-to-end evaluation of multiple interacting ML models within a production VM allocator. It frames the problem as a bi-level optimization with probabilistic constraints and uses a time-partitioning scheme plus a CEGAR-based mapping to identify adversarial inputs and the underlying VM features that cause degradation. Key findings include that SANJESH can reveal scenarios up to $4\times$ worse than trace-based simulations and that the CPU prediction often drives the primary end-to-end risk. The results demonstrate SANJESH's ability to answer practical operator questions, produce actionable risk surfaces, and generalize to other ML-enabled systems beyond the VM allocator.

Abstract

Many operational cloud systems use one or more machine learning models that help them achieve better efficiency and performance. But operators do not have tools to help them understand how each model and the interaction between them affect the end-to-end system performance. SANJESH is such a tool. SANJESH supports a diverse set of performance-related queries which we answer through a bi-level optimization. We invent novel mechanisms to solve this optimization more quickly. These techniques allow us to solve an optimization which prior work failed to solve even after $24$ hours. As a proof of concept, we apply SANJESH to an example production system that uses multiple ML models to optimize virtual machine (VM) placement. These models impact how many servers the operators uses to host VMs and the frequency with which it has to live-migrate them because the servers run out of resources. SANJESH finds scenarios where these models cause $~4\times$ worse performance than what simulation-based approaches detect.

A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

TL;DR

This work presents SANJESH, a holistic analysis tool for ML-augmented systems that enables end-to-end evaluation of multiple interacting ML models within a production VM allocator. It frames the problem as a bi-level optimization with probabilistic constraints and uses a time-partitioning scheme plus a CEGAR-based mapping to identify adversarial inputs and the underlying VM features that cause degradation. Key findings include that SANJESH can reveal scenarios up to worse than trace-based simulations and that the CPU prediction often drives the primary end-to-end risk. The results demonstrate SANJESH's ability to answer practical operator questions, produce actionable risk surfaces, and generalize to other ML-enabled systems beyond the VM allocator.

Abstract

Many operational cloud systems use one or more machine learning models that help them achieve better efficiency and performance. But operators do not have tools to help them understand how each model and the interaction between them affect the end-to-end system performance. SANJESH is such a tool. SANJESH supports a diverse set of performance-related queries which we answer through a bi-level optimization. We invent novel mechanisms to solve this optimization more quickly. These techniques allow us to solve an optimization which prior work failed to solve even after hours. As a proof of concept, we apply SANJESH to an example production system that uses multiple ML models to optimize virtual machine (VM) placement. These models impact how many servers the operators uses to host VMs and the frequency with which it has to live-migrate them because the servers run out of resources. SANJESH finds scenarios where these models cause worse performance than what simulation-based approaches detect.

Paper Structure

This paper contains 26 sections, 15 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: The CPU model has the most impact on the number of live-migrations. We engineer each of the ML models to have 70% accuracy and only use predictions from one of the models (we use the ground truth for the rest) in each experiment. The y-axis shows how the models change the risk of live-migrations relative to the memory model.
  • Figure 2: Sanjesh overview. Users provide the ML models, their feature dependencies, and example test data, along with the specific queries they want Sanjesh to answer (\ref{['table:rc_usecases']}). The analyzer and constraint generator convert these inputs into probabilistic constraints and formulate the problem. The bi-level solver computes adversarial predictions of the models and ground truth labels that lead the VM allocator to underperform. A CEGAR-based approach then derives VM feature sequences that produce these predictions. Sanjesh outputs the resulting VM sequences and predictions to the user.
  • Figure 3: Examples of how we model the mechanisms in \ref{['table:mechanisms']}. We place $N=10$ VMs and use $x_i$ to indicate whether the CPU model's predictions match the ground truth ($x'_j$ for the hypothetical model); $y_i$ captures whether the correct prediction is $0$. We use $\mathcal{R}_6$ to describe a random sampling of size $6$ from the set $\{1, 2, \dots, 10$}.
  • Figure 4: Here, when the feature $f_2$ is large the risk surface (shaded region) is a smaller portion of the feature space (potentially, less likely) compared to when $f_2$ is small (left); we look at the projection of $(f_1. f_3)$ on the $f_2$ plane (right).
  • Figure 5: An example of how we use the CEGAR-based approach. We consider a simple multi-class LGBM classifier where the model has one tree per class and takes three features ($f_1, f_2, f_3$) as input. To check if there exists an input feature vector that causes the model to predict Class 2 we apply a CEGAR strategy as follows: we cut the trees to depth 1 and approximate the parts we pruned with the minimum leaf value in the trees for Class 1 and Class 3 and the maximum leaf value for Class 2. If we find the SMT query is infeasible that means Class 2 is unreachable but otherwise expand the trees and repeat the process.
  • ...and 9 more figures