Table of Contents
Fetching ...

Updates on the Low-Level Abstraction of Memory Access

Bernhard Manfred Gruber

TL;DR

The paper addresses the growing gap between computation and memory performance on heterogeneous hardware by presenting LLAMA, a portable C++ library that decouples data layout from algorithms via exchangeable memory mappings with zero-runtime overhead. It introduces compile-time array extents to enable stateless, memcpy-friendly views and discusses new mappings (BitpackIntSoA, BitpackFloatSoA, ByteStreamSplit, Null) for flexible data representations. It also presents memory access instrumentation (FieldAccessCount, Heatmap) and explicit SIMD support through SimdN, loadSimd, and storeSimd, demonstrating potential performance parity with hand-tuned code on real hardware. The work highlights significant potential for cross-architecture portability and performance tuning while acknowledging overheads and outlining future work for broader application and platform support.

Abstract

Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. The low-level abstraction of memory access (LLAMA) is a C++ library that provides a zero-runtime-overhead abstraction layer, underneath which memory mappings can be freely exchanged to customize data layouts, memory access and access instrumentation, focusing on multidimensional arrays of nested, structured data. After its scientific debut, several improvements and extensions have been added to LLAMA. This includes compile-time array extents for zero-memory-overhead views, support for computations during memory access, new mappings for bit-packing, switching types, byte-splitting, memory access instrumentation, and explicit SIMD support. This contribution provides an overview of recent developments in the LLAMA library.

Updates on the Low-Level Abstraction of Memory Access

TL;DR

The paper addresses the growing gap between computation and memory performance on heterogeneous hardware by presenting LLAMA, a portable C++ library that decouples data layout from algorithms via exchangeable memory mappings with zero-runtime overhead. It introduces compile-time array extents to enable stateless, memcpy-friendly views and discusses new mappings (BitpackIntSoA, BitpackFloatSoA, ByteStreamSplit, Null) for flexible data representations. It also presents memory access instrumentation (FieldAccessCount, Heatmap) and explicit SIMD support through SimdN, loadSimd, and storeSimd, demonstrating potential performance parity with hand-tuned code on real hardware. The work highlights significant potential for cross-architecture portability and performance tuning while acknowledging overheads and outlining future work for broader application and platform support.

Abstract

Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. The low-level abstraction of memory access (LLAMA) is a C++ library that provides a zero-runtime-overhead abstraction layer, underneath which memory mappings can be freely exchanged to customize data layouts, memory access and access instrumentation, focusing on multidimensional arrays of nested, structured data. After its scientific debut, several improvements and extensions have been added to LLAMA. This includes compile-time array extents for zero-memory-overhead views, support for computations during memory access, new mappings for bit-packing, switching types, byte-splitting, memory access instrumentation, and explicit SIMD support. This contribution provides an overview of recent developments in the LLAMA library.
Paper Structure (6 sections, 3 figures, 1 table)

This paper contains 6 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Conceptual overview of LLAMA.
  • Figure 2: A SIMD version of the n-body update routine from the original LLAMA paper llama_paper, using std::fixed_size_simd as SIMD technology, as proposed for C⁠ +⁠ +26 std_simd.
  • Figure 3: Benchmark of the CPU LLAMA n-body with a selection of popular mappings against various manually written versions.