Table of Contents
Fetching ...

Gem5-AcceSys: Enabling System-Level Exploration of Standard Interconnects for Novel Accelerators

Qunyou Liu, Marina Zapater, David Atienza

TL;DR

Gem5-AcceSys extends the Gem5 simulator to enable system-level co-design and evaluation of standard interconnects (e.g., PCIe) and configurable memory hierarchies for accelerator systems. Using a transformer-tuned GEMM accelerator (MatrixFlow) as a case study, the framework analyzes PCIe bandwidth, memory types, and address translation, and distinguishes GEMM and Non-GEMM workload contributions. Key findings show that optimized interconnects can reach up to 80% of device-side memory performance, with device-side memory offering substantial gains but introducing NUMA-related trade-offs for Non-GEMM tasks. The work provides actionable guidance for balancing performance and cost in next-generation accelerators, particularly for transformer workloads.

Abstract

The growing demand for efficient, high-performance processing in machine learning (ML) and image processing has made hardware accelerators, such as GPUs and Data Streaming Accelerators (DSAs), increasingly essential. These accelerators enhance ML and image processing tasks by offloading computation from the CPU to dedicated hardware. These accelerators rely on interconnects for efficient data transfer, making interconnect design crucial for system-level performance. This paper introduces Gem5-AcceSys, an innovative framework for system-level exploration of standard interconnects and configurable memory hierarchies. Using a matrix multiplication accelerator tailored for transformer workloads as a case study, we evaluate PCIe performance across diverse memory types (DDR4, DDR5, GDDR6, HBM2) and configurations, including host-side and device-side memory. Our findings demonstrate that optimized interconnects can achieve up to 80% of device-side memory performance and, in some scenarios, even surpass it. These results offer actionable insights for system architects, enabling a balanced approach to performance and cost in next-generation accelerator design.

Gem5-AcceSys: Enabling System-Level Exploration of Standard Interconnects for Novel Accelerators

TL;DR

Gem5-AcceSys extends the Gem5 simulator to enable system-level co-design and evaluation of standard interconnects (e.g., PCIe) and configurable memory hierarchies for accelerator systems. Using a transformer-tuned GEMM accelerator (MatrixFlow) as a case study, the framework analyzes PCIe bandwidth, memory types, and address translation, and distinguishes GEMM and Non-GEMM workload contributions. Key findings show that optimized interconnects can reach up to 80% of device-side memory performance, with device-side memory offering substantial gains but introducing NUMA-related trade-offs for Non-GEMM tasks. The work provides actionable guidance for balancing performance and cost in next-generation accelerators, particularly for transformer workloads.

Abstract

The growing demand for efficient, high-performance processing in machine learning (ML) and image processing has made hardware accelerators, such as GPUs and Data Streaming Accelerators (DSAs), increasingly essential. These accelerators enhance ML and image processing tasks by offloading computation from the CPU to dedicated hardware. These accelerators rely on interconnects for efficient data transfer, making interconnect design crucial for system-level performance. This paper introduces Gem5-AcceSys, an innovative framework for system-level exploration of standard interconnects and configurable memory hierarchies. Using a matrix multiplication accelerator tailored for transformer workloads as a case study, we evaluate PCIe performance across diverse memory types (DDR4, DDR5, GDDR6, HBM2) and configurations, including host-side and device-side memory. Our findings demonstrate that optimized interconnects can achieve up to 80% of device-side memory performance and, in some scenarios, even surpass it. These results offer actionable insights for system architects, enabling a balanced approach to performance and cost in next-generation accelerator design.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Design Framework Architecture
  • Figure 2: Roofline Model of the Accelerator System
  • Figure 3: Performance (Execution time) for Matrix Size 2048 under varying per-lane bandwidth and number of lanes
  • Figure 4: Execution Time under different packet sizes for different PCIe bandwidth.
  • Figure 5: Impact of DRAM type and location
  • ...and 4 more figures