Table of Contents
Fetching ...

VLCs: Managing Parallelism with Virtualized Libraries

Yineng Yan, William Ruys, Hochan Lee, Ian Henriksen, Arthur Peters, Sean Stephens, Bozhi You, Henrique Fingler, Martin Burtscher, Milos Gligoric, Keshav Pingali, Mattan Erez, George Biros, Christopher J. Rossbach

TL;DR

VLCs introduce library-level virtualization to partition resources among parallel libraries within a single process, avoiding code changes to libraries or OS modifications. By isolating libraries using linker namespaces and interposing resource queries via a user-space VLC Monitor, VLCs provide fine-grained CPU/GPU partitioning and enable parallelizing otherwise thread-unsafe calls and nested parallelism. The authors implement C++ (VLC++) and Python (PyVLC) prototypes, demonstrate low overhead, and show speedups up to 2.85x in synthetic workloads and up to 1.96x for ARPACK, with 1.46x gains in Kokkos multi-GPU scenarios compared to MPI. This approach offers a practical, general solution for composing parallel libraries with reduced contention and improved resource utilization in modern heterogeneous systems.

Abstract

As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible. We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process. In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch.

VLCs: Managing Parallelism with Virtualized Libraries

TL;DR

VLCs introduce library-level virtualization to partition resources among parallel libraries within a single process, avoiding code changes to libraries or OS modifications. By isolating libraries using linker namespaces and interposing resource queries via a user-space VLC Monitor, VLCs provide fine-grained CPU/GPU partitioning and enable parallelizing otherwise thread-unsafe calls and nested parallelism. The authors implement C++ (VLC++) and Python (PyVLC) prototypes, demonstrate low overhead, and show speedups up to 2.85x in synthetic workloads and up to 1.96x for ARPACK, with 1.46x gains in Kokkos multi-GPU scenarios compared to MPI. This approach offers a practical, general solution for composing parallel libraries with reduced contention and improved resource utilization in modern heterogeneous systems.

Abstract

As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible. We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process. In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch.

Paper Structure

This paper contains 29 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The speedup ratio of parallel hyperparameter tuning on a Transformer model in C++ LibTorch relative to sequential hyperparameter tuning.
  • Figure 2: Heatmap of relative execution times for CPU core partitions between two concurrent hyperparameter tuning tasks. Lighter regions indicate shorter execution times. The optimal partition is marked with a green box. Blue boxes represent partitions achievable with LibTorch APIs. Without VLCs, the optimal partition cannot be achieved.
  • Figure 3: Overview of VLCs and other virtualization techniques. Dashed boxes represent resource managers and solid boxes are the unit of management.
  • Figure 4: Programming model of VLCs. OpenMP and OpenBLAS are loaded into separate VLCs with different resource allocations; Resource query system calls are interposed by the VLC Monitor.
  • Figure 5: Service VLC Control Flow. OpenMP is loaded into a VLC and links to the generated shim of pthreads. The calls to pthreads are redirected to Service VLC.
  • ...and 6 more figures