Table of Contents
Fetching ...

Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR

Jude Haris, Nicolas Bohm Agostini, Antonino Tumeo, David Kaeli, José Cano

TL;DR

Efficient data movement and host-accelerator coordination is essential as custom accelerators become prevalent. AXI4MLIR extends the MLIR framework to generate host-driver code that offloads linear algebra operations to AXI4-Stream based accelerators, and it introduces three data-transfer optimizations to hide data movement costs. The three contributions are DMA-based data allocation, data coalescing, and software pipelining with double-buffering, motivated by a MatMul case study showing a baseline accelerator utilization of $<10\%$ and a host bottleneck in heap-to-DMA transfers. These automated optimizations reduce manual engineering effort and improve accelerator utilization and total latency, enabling more efficient execution of matrix and linear algebra workloads on heterogeneous systems.

Abstract

As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.

Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR

TL;DR

Efficient data movement and host-accelerator coordination is essential as custom accelerators become prevalent. AXI4MLIR extends the MLIR framework to generate host-driver code that offloads linear algebra operations to AXI4-Stream based accelerators, and it introduces three data-transfer optimizations to hide data movement costs. The three contributions are DMA-based data allocation, data coalescing, and software pipelining with double-buffering, motivated by a MatMul case study showing a baseline accelerator utilization of and a host bottleneck in heap-to-DMA transfers. These automated optimizations reduce manual engineering effort and improve accelerator utilization and total latency, enabling more efficient execution of matrix and linear algebra workloads on heterogeneous systems.

Abstract

As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.
Paper Structure (6 sections, 2 figures)

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: Breakdown of clock cycles spent inside a simple MatMul accelerator. Red segments (Compute C %) represents the time where the accelerator's processing elements are active.
  • Figure 2: Psedo-MLIR code of a tiled MatMul problem showcasing baseline and proposed extensions.