VLCs: Managing Parallelism with Virtualized Libraries
Yineng Yan, William Ruys, Hochan Lee, Ian Henriksen, Arthur Peters, Sean Stephens, Bozhi You, Henrique Fingler, Martin Burtscher, Milos Gligoric, Keshav Pingali, Mattan Erez, George Biros, Christopher J. Rossbach
TL;DR
VLCs introduce library-level virtualization to partition resources among parallel libraries within a single process, avoiding code changes to libraries or OS modifications. By isolating libraries using linker namespaces and interposing resource queries via a user-space VLC Monitor, VLCs provide fine-grained CPU/GPU partitioning and enable parallelizing otherwise thread-unsafe calls and nested parallelism. The authors implement C++ (VLC++) and Python (PyVLC) prototypes, demonstrate low overhead, and show speedups up to 2.85x in synthetic workloads and up to 1.96x for ARPACK, with 1.46x gains in Kokkos multi-GPU scenarios compared to MPI. This approach offers a practical, general solution for composing parallel libraries with reduced contention and improved resource utilization in modern heterogeneous systems.
Abstract
As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible. We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process. In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch.
