Table of Contents
Fetching ...

vPALs: Towards Verified Performance-aware Learning System For Resource Management

Guoliang He, Gingfung Yeung, Sheriffo Ceesay, Adam Barker

TL;DR

vPALs addresses runtime performance prediction in cluster resource management by leveraging Pressure Stall Information ($PSI$) and other system metrics, and by enforcing formal verification of per-application DNN predictors. The approach uses monotonicity specifications and Ouroboros-based verification to ensure predictions respect domain knowledge, yielding safe, bounded behavior at deployment. Empirical results show that verified DNNs maintain or slightly improve accuracy compared with unverified models, and achieve perfect accuracy on counterexamples, while exposing limitations in verifier scalability and generalization under distribution shifts. The work highlights the practical viability of integrating verifiable learning into resource management, and outlines future directions toward a unified learner and uncertainty-aware prediction to scale safety in AI-enabled clusters.

Abstract

Accurately predicting task performance at runtime in a cluster is advantageous for a resource management system to determine whether a task should be migrated due to performance degradation caused by interference. This is beneficial for both cluster operators and service owners. However, deploying performance prediction systems with learning methods requires sophisticated safeguard mechanisms due to the inherent stochastic and black-box natures of these models, such as Deep Neural Networks (DNNs). Vanilla Neural Networks (NNs) can be vulnerable to out-of-distribution data samples that can lead to sub-optimal decisions. To take a step towards a safe learning system in performance prediction, We propose vPALs that leverage well-correlated system metrics, and verification to produce safe performance prediction at runtime, providing an extra layer of safety to integrate learning techniques to cluster resource management systems. Our experiments show that vPALs can outperform vanilla NNs across our benchmark workload.

vPALs: Towards Verified Performance-aware Learning System For Resource Management

TL;DR

vPALs addresses runtime performance prediction in cluster resource management by leveraging Pressure Stall Information () and other system metrics, and by enforcing formal verification of per-application DNN predictors. The approach uses monotonicity specifications and Ouroboros-based verification to ensure predictions respect domain knowledge, yielding safe, bounded behavior at deployment. Empirical results show that verified DNNs maintain or slightly improve accuracy compared with unverified models, and achieve perfect accuracy on counterexamples, while exposing limitations in verifier scalability and generalization under distribution shifts. The work highlights the practical viability of integrating verifiable learning into resource management, and outlines future directions toward a unified learner and uncertainty-aware prediction to scale safety in AI-enabled clusters.

Abstract

Accurately predicting task performance at runtime in a cluster is advantageous for a resource management system to determine whether a task should be migrated due to performance degradation caused by interference. This is beneficial for both cluster operators and service owners. However, deploying performance prediction systems with learning methods requires sophisticated safeguard mechanisms due to the inherent stochastic and black-box natures of these models, such as Deep Neural Networks (DNNs). Vanilla Neural Networks (NNs) can be vulnerable to out-of-distribution data samples that can lead to sub-optimal decisions. To take a step towards a safe learning system in performance prediction, We propose vPALs that leverage well-correlated system metrics, and verification to produce safe performance prediction at runtime, providing an extra layer of safety to integrate learning techniques to cluster resource management systems. Our experiments show that vPALs can outperform vanilla NNs across our benchmark workload.
Paper Structure (24 sections, 2 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Neural Network Training and Verification System Overview. Performance data and verification bounds are required to produce a satisfactory DNN.
  • Figure 2: Mindspore and Nginx demonstrate a strong correlation between application performance and the system metrics. Solr demonstrates a medium correlation, while Redis and Mysql show a weak correlation between application performance and system metrics. The red line indicates where the Pearson coefficient is $0.5$ and the blue line is $0$.
  • Figure 3: vPALs data pipeline.
  • Figure 4: Training Overview. The input feature vector consists of a concatenation of system metrics and workload configurations, such as CPU cores and memory requested.
  • Figure 5: Training loss versus epochs. The yellow interval is when the verifier activates and performs verification.
  • ...and 1 more figures