Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems
Ewnetu Bayuh Lakew, Petter Svärd, Erik Elmroth, Johan Tordsson
TL;DR
This work addresses performance variability in large-scale disaggregated NUMA systems by introducing a NUMA-aware, two-stage VM-to-core mapping that first minimizes remote memory access and then reduces inter-application interference using workload classifications and runtime counters. Implemented on a six-node NumaConnect-based platform with 288 cores and ~1TB RAM, the approach yields substantial performance gains and greatly reduced variability compared to vanilla Linux scheduling. The study combines hardware measurements (IPC and MPI) with real and synthetic workloads to demonstrate scalability benefits for memory-intensive applications in disaggregated cloud infrastructure. Overall, the proposed mapping framework offers practical improvements for resource utilization and paves the way for memory-aware scheduling enhancements in future disaggregated systems.
Abstract
Disaggregated systems have a novel architecture motivated by the requirements of resource intensive applications such as social networking, search, and in-memory databases. The total amount of resources such as memory and CPU cores is very large in such systems. However, the distributed topology of disaggregated server systems result in non-uniform access latency and performance, with both NUMA aspects inside each box, as well as additional access latency for remote resources. In this work, we study the effects complex NUMA topologies on application performance and propose a method for improved, NUMA-aware, mapping for virtualized environments running on disaggregated systems. Our mapping algorithm is based on pinning of virtual cores and/or migration of memory across a disaggregated system and takes into account application performance, resource contention, and utilization. The proposed method is evaluated on a 288 cores and around 1TB memory system, composed of six disaggregated commodity servers, through a combination of benchmarks and real applications such as memory intensive graph databases. Our evaluation demonstrates significant improvement over the vanilla resource mapping methods. Overall, the mapping algorithm is able to improve performance by significant magnitude compared the default Linux scheduler used in system.
