Table of Contents
Fetching ...

Enabling Heterogeneous Performance Analysis for Scientific Workloads

Maksymilian Graczyk, Vincent Desbiolles, Stefan Roiser, Andrea Guerrieri

TL;DR

This paper tackles profiling challenges in heterogeneous scientific workloads by evaluating architecture-agnostic performance analysis with Adaptyst. It focuses on two eBPF-based profiling options, Uprobes and USDT, assessing their runtime overhead and deployment complexity. Using a small C benchmark on a dedicated workstation, the study reports overheads around 4.8–5.1% and notes that Uprobes imposes more system-time while USDT shows slightly higher variability. The results inform a roadmap for integrating eBPF-based profiling into Adaptyst and extending capabilities to non-CPU devices, advancing heterogeneous performance analysis for scientific workloads.

Abstract

Heterogeneous computing integrates diverse processing elements, such as CPUs, GPUs, and FPGAs, within a single system, aiming to leverage the strengths of each architecture to optimize performance and energy consumption. In this context, efficient performance analysis plays a critical role in determining the most suitable platform for dispatching tasks, ensuring that workloads are allocated to the processing units where they can execute most effectively. Adaptyst is a novel ongoing effort at CERN, with the aim to develop an open-source, architecture-agnostic performance analysis for scientific workloads. This study explores the performance and implementation complexity of two built-in eBPF-based methods such as Uprobes and USDT, with the aim of outlining a roadmap for future integration into Adaptyst and advancing toward heterogeneous performance analysis capabilities.

Enabling Heterogeneous Performance Analysis for Scientific Workloads

TL;DR

This paper tackles profiling challenges in heterogeneous scientific workloads by evaluating architecture-agnostic performance analysis with Adaptyst. It focuses on two eBPF-based profiling options, Uprobes and USDT, assessing their runtime overhead and deployment complexity. Using a small C benchmark on a dedicated workstation, the study reports overheads around 4.8–5.1% and notes that Uprobes imposes more system-time while USDT shows slightly higher variability. The results inform a roadmap for integrating eBPF-based profiling into Adaptyst and extending capabilities to non-CPU devices, advancing heterogeneous performance analysis for scientific workloads.

Abstract

Heterogeneous computing integrates diverse processing elements, such as CPUs, GPUs, and FPGAs, within a single system, aiming to leverage the strengths of each architecture to optimize performance and energy consumption. In this context, efficient performance analysis plays a critical role in determining the most suitable platform for dispatching tasks, ensuring that workloads are allocated to the processing units where they can execute most effectively. Adaptyst is a novel ongoing effort at CERN, with the aim to develop an open-source, architecture-agnostic performance analysis for scientific workloads. This study explores the performance and implementation complexity of two built-in eBPF-based methods such as Uprobes and USDT, with the aim of outlining a roadmap for future integration into Adaptyst and advancing toward heterogeneous performance analysis capabilities.

Paper Structure

This paper contains 6 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: The modular design of Adaptyst as a software-hardware co-design framework. Stateful dataflow multigraphs dace are used as the IR. (a) Every module describes how a corresponding Adaptyst IR should be generated for its use case. (b) Every node is assigned to a backend module which describes how a specific system component should be modelled/profiled, how it can connect to other nodes, and (later) how a match between the component and the Adaptyst IR region should be calculated.
  • Figure 2: System versus User Breakdown over 100 warm-up runs and 1000 measurement runs. Uprobes uses more system time than USDT.