Table of Contents
Fetching ...

The SAP Cloud Infrastructure Dataset: A Reality Check of Scheduling and Placement of VMs in Cloud Computing

Arno Uhlig, Iris Braun, Matthias Wählisch

TL;DR

This paper analyzes SAP's production cloud infrastructure to diagnose VM scheduling and placement inefficiencies in a real-world, memory-heavy enterprise environment. Using telemetry from 1,800 hypervisors and 48,000 VMs across 30 days, the authors reveal substantial CPU contention and overprovisioning, particularly for CPU resources, while memory is often underutilized, highlighting imbalances and fragmentation. The work demonstrates suboptimal performance of vanilla OpenStack Nova scheduling in handling diverse SAP workloads and provides a publicly available, high-resolution dataset to enable data-driven evaluation and development of improved, holistic scheduling strategies. The study offers concrete guidance for combining placement with dynamic rescheduling, adopting memory-centric bin-packing where appropriate, and extending platform schedulers to incorporate historical utilization signals, thereby reducing fragmentation and improving resource efficiency in large-scale cloud environments.

Abstract

Allocating resources in a distributed environment is a fundamental challenge. In this paper, we analyze the scheduling and placement of virtual machines (VMs) in the cloud platform of SAP, the world's largest enterprise resource planning software vendor. Based on data from roughly 1,800 hypervisors and 48,000 VMs within a 30-day observation period, we highlight potential improvements for workload management. The data was measured through observability tooling that tracks resource usage and performance metrics across the entire infrastructure. In contrast to existing datasets, ours uniquely offers fine-grained time-series telemetry data of fully virtualized enterprise-level workloads from both long-running and memory-intensive SAP S/4HANA and diverse, general-purpose applications. Our key findings include several suboptimal scheduling situations, such as CPU resource contention exceeding 40%, CPU ready times of up to 220 seconds, significantly imbalanced compute hosts with a maximum CPU~utilization on intra-building block hosts of up to 99%, and overprovisioned CPU and memory resources resulting into over 80% of VMs using less than 70% of the provided resources. Bolstered by these findings, we derive requirements for the design and implementation of novel placement and scheduling algorithms and provide guidance to optimize resource allocations. We make the full dataset used in this study publicly available to enable data-driven evaluations of scheduling approaches for large-scale cloud infrastructures in future research.

The SAP Cloud Infrastructure Dataset: A Reality Check of Scheduling and Placement of VMs in Cloud Computing

TL;DR

This paper analyzes SAP's production cloud infrastructure to diagnose VM scheduling and placement inefficiencies in a real-world, memory-heavy enterprise environment. Using telemetry from 1,800 hypervisors and 48,000 VMs across 30 days, the authors reveal substantial CPU contention and overprovisioning, particularly for CPU resources, while memory is often underutilized, highlighting imbalances and fragmentation. The work demonstrates suboptimal performance of vanilla OpenStack Nova scheduling in handling diverse SAP workloads and provides a publicly available, high-resolution dataset to enable data-driven evaluation and development of improved, holistic scheduling strategies. The study offers concrete guidance for combining placement with dynamic rescheduling, adopting memory-centric bin-packing where appropriate, and extending platform schedulers to incorporate historical utilization signals, thereby reducing fragmentation and improving resource efficiency in large-scale cloud environments.

Abstract

Allocating resources in a distributed environment is a fundamental challenge. In this paper, we analyze the scheduling and placement of virtual machines (VMs) in the cloud platform of SAP, the world's largest enterprise resource planning software vendor. Based on data from roughly 1,800 hypervisors and 48,000 VMs within a 30-day observation period, we highlight potential improvements for workload management. The data was measured through observability tooling that tracks resource usage and performance metrics across the entire infrastructure. In contrast to existing datasets, ours uniquely offers fine-grained time-series telemetry data of fully virtualized enterprise-level workloads from both long-running and memory-intensive SAP S/4HANA and diverse, general-purpose applications. Our key findings include several suboptimal scheduling situations, such as CPU resource contention exceeding 40%, CPU ready times of up to 220 seconds, significantly imbalanced compute hosts with a maximum CPU~utilization on intra-building block hosts of up to 99%, and overprovisioned CPU and memory resources resulting into over 80% of VMs using less than 70% of the provided resources. Bolstered by these findings, we derive requirements for the design and implementation of novel placement and scheduling algorithms and provide guidance to optimize resource allocations. We make the full dataset used in this study publicly available to enable data-driven evaluations of scheduling approaches for large-scale cloud infrastructures in future research.

Paper Structure

This paper contains 44 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Hierarchical abstractions in cloud computing infrastructure
  • Figure 2: Simplified architecture of scheduling-relevant components in OpenStack Nova and VMware
  • Figure 3: Scheduling of resources can be influenced by filtering and weighting, which increases complexity. Host numbering is reversed in the third process to illustrate how weighting can alter host priorities.
  • Figure 4: Regional deployments of the SAP Cloud Infrastructure across SAP owned and shared datacenters.
  • Figure 5: Daily average percentage of free CPU resources per host within a single data center
  • ...and 10 more figures