Table of Contents
Fetching ...

Examem: Low-Overhead Memory Instrumentation for Intelligent Memory Systems

Ashwin Poduval, Hayden Coffey, Michael Swift

TL;DR

Examem is a memory performance introspection framework based on the LLVM compiler infrastructure that statically records information about the instruction mix of the code and adds dynamic instrumentation to produce estimated memory bandwidth for an instrumented region at runtime.

Abstract

Memory performance is often the main bottleneck in modern computing systems. In recent years, researchers have attempted to scale the memory wall by leveraging new technology such as CXL, HBM, and in- and near-memory processing. Developers optimizing for such hardware need to understand how target applications perform to fully take advantage of these systems. Existing software and hardware performance introspection techniques are ill-suited for this purpose due to one or more of the following factors: coarse-grained measurement, inability to offer data needed to debug key issues, high runtime overhead, and hardware dependence. The heightened integration between compute and memory in many proposed systems offers an opportunity to extend compiler support for this purpose. We have developed Examem, a memory performance introspection framework based on the LLVM compiler infrastructure. Examem supports developer annotated regions in code, allowing for targeted instrumentation of kernels. Examem supports hardware performance counters when available, in addition to software instrumentation. It statically records information about the instruction mix of the code and adds dynamic instrumentation to produce estimated memory bandwidth for an instrumented region at runtime. This combined approach keeps runtime overhead low while remaining accurate, with a geomean overhead under 10% and a geomean byte accuracy of 93%. Finally, our instrumentation is performed using an LLVM IR pass, which is target agnostic, and we have applied it to four ISAs.

Examem: Low-Overhead Memory Instrumentation for Intelligent Memory Systems

TL;DR

Examem is a memory performance introspection framework based on the LLVM compiler infrastructure that statically records information about the instruction mix of the code and adds dynamic instrumentation to produce estimated memory bandwidth for an instrumented region at runtime.

Abstract

Memory performance is often the main bottleneck in modern computing systems. In recent years, researchers have attempted to scale the memory wall by leveraging new technology such as CXL, HBM, and in- and near-memory processing. Developers optimizing for such hardware need to understand how target applications perform to fully take advantage of these systems. Existing software and hardware performance introspection techniques are ill-suited for this purpose due to one or more of the following factors: coarse-grained measurement, inability to offer data needed to debug key issues, high runtime overhead, and hardware dependence. The heightened integration between compute and memory in many proposed systems offers an opportunity to extend compiler support for this purpose. We have developed Examem, a memory performance introspection framework based on the LLVM compiler infrastructure. Examem supports developer annotated regions in code, allowing for targeted instrumentation of kernels. Examem supports hardware performance counters when available, in addition to software instrumentation. It statically records information about the instruction mix of the code and adds dynamic instrumentation to produce estimated memory bandwidth for an instrumented region at runtime. This combined approach keeps runtime overhead low while remaining accurate, with a geomean overhead under 10% and a geomean byte accuracy of 93%. Finally, our instrumentation is performed using an LLVM IR pass, which is target agnostic, and we have applied it to four ISAs.

Paper Structure

This paper contains 29 sections, 5 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Data-intensive kernel with unequal bandwidth usage. The figure on the left is transformed into the figure on the right by the algorithm.
  • Figure 2: Overview of Examem's components.
  • Figure 3: Examem applies timing events and software counters to regions of interest.
  • Figure 4: Example post-domination set. Block C post-dominates block A; one counter covers the two blocks. Block B does not post-dominate A, and is given its own counter.
  • Figure 5: Example loop SW counter hoisting. One counter covers both for loops by scaling the instruction mixes of the basic blocks by the loop trip counts.
  • ...and 8 more figures