Table of Contents
Fetching ...

Exploiting Application-to-Architecture Dependencies for Designing Scalable OS

Yao Xiao, Nikos Kanakaris, Anzhe Cheng, Chenzhong Yin, Nesreen K. Ahmed, Shahin Nazarian, Andrei Irimia, Paul Bogdan

TL;DR

The paper addresses OS scalability and application-awareness gaps on multi-core platforms by introducing NetworkedOS, a four-layer cross-layer network abstraction that links dynamic application instructions, kernel interactions, memory frames, and hardware cores. It combines compile-time optimization, via an overlapping-cluster partitioning that minimizes a quality function $T$ balancing sequential work, parallel work, and IPC, with a run-time greedy mapper that assigns processes to cores based on memory affinity and inter-process interactions. The approach is instantiated by constructing a four-layer network from instruction traces, defining $T$ to guide partitioning, and executing an $O(P)$ runtime scheduling strategy to reduce IPC and messaging. Empirical evaluation on multi-core hardware shows substantial improvements over MINIX3, Linux, and Barrelfish in IPC efficiency and application performance on NAS PARSEC benchmarks, highlighting the practical potential of cross-layer OS design for scalable, affinity-aware scheduling on large-core systems. The key contributions include the formal multi-layer network model, the overlapping-cluster partitioning, and the memory-affinity–driven runtime mapping, with demonstrated gains up to several-fold in real-system experiments.

Abstract

With the advent of hundreds of cores on a chip to accelerate applications, the operating system (OS) needs to exploit the existing parallelism provided by the underlying hardware resources to determine the right amount of processes to be mapped on the multi-core systems. However, the existing OS is not scalable and is oblivious to applications. We address these issues by adopting a multi-layer network representation of the dynamic application-to OS-to-architecture dependencies, namely the NetworkedOS. We adopt a compile-time analysis and construct a network representing the dependencies between dynamic instructions translated from the applications and the kernel and services. We propose an overlapping partitioning scheme to detect the clusters or processes that can potentially run in parallel to be mapped onto cores while reducing the number of messages transferred. At run time, processes are mapped onto the multi-core systems, taking into consideration the process affinity. Our experimental results indicate that NetworkedOS achieves performance improvement as high as 7.11x compared to Linux running on a 128-core system and 2.01x to Barrelfish running on a 64-core system.

Exploiting Application-to-Architecture Dependencies for Designing Scalable OS

TL;DR

The paper addresses OS scalability and application-awareness gaps on multi-core platforms by introducing NetworkedOS, a four-layer cross-layer network abstraction that links dynamic application instructions, kernel interactions, memory frames, and hardware cores. It combines compile-time optimization, via an overlapping-cluster partitioning that minimizes a quality function balancing sequential work, parallel work, and IPC, with a run-time greedy mapper that assigns processes to cores based on memory affinity and inter-process interactions. The approach is instantiated by constructing a four-layer network from instruction traces, defining to guide partitioning, and executing an runtime scheduling strategy to reduce IPC and messaging. Empirical evaluation on multi-core hardware shows substantial improvements over MINIX3, Linux, and Barrelfish in IPC efficiency and application performance on NAS PARSEC benchmarks, highlighting the practical potential of cross-layer OS design for scalable, affinity-aware scheduling on large-core systems. The key contributions include the formal multi-layer network model, the overlapping-cluster partitioning, and the memory-affinity–driven runtime mapping, with demonstrated gains up to several-fold in real-system experiments.

Abstract

With the advent of hundreds of cores on a chip to accelerate applications, the operating system (OS) needs to exploit the existing parallelism provided by the underlying hardware resources to determine the right amount of processes to be mapped on the multi-core systems. However, the existing OS is not scalable and is oblivious to applications. We address these issues by adopting a multi-layer network representation of the dynamic application-to OS-to-architecture dependencies, namely the NetworkedOS. We adopt a compile-time analysis and construct a network representing the dependencies between dynamic instructions translated from the applications and the kernel and services. We propose an overlapping partitioning scheme to detect the clusters or processes that can potentially run in parallel to be mapped onto cores while reducing the number of messages transferred. At run time, processes are mapped onto the multi-core systems, taking into consideration the process affinity. Our experimental results indicate that NetworkedOS achieves performance improvement as high as 7.11x compared to Linux running on a 128-core system and 2.01x to Barrelfish running on a 64-core system.
Paper Structure (7 sections, 4 equations, 3 figures, 1 table)

This paper contains 7 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the NetworkedOS framework. First at compile time, we build the multi-layer network to analyze the correlations among applications, kernel, and different services (e.g., device drivers and file systems). We then partition the application layer into optimal number of processes to be executed on the multicore platform. Next, based on the interactions among processes and the number of page frames utilized in each process, we propose a scheduling algorithm to map processes onto cores at run time.
  • Figure 2: Overview of the multi-layer network construction. We convert high-level languages into the corresponding dynamic low-level instructions. Using code tracing, analysis and profiling, we keep track of instructions in each basic block, analyze dependencies, and profile instructions to form an interconnected multi-layer network.
  • Figure 3: (Top) Application speedup comparison. (Bottom left) Execution time on a 2-core machine. (Bottom right) Execution time with a 1KB message.