Table of Contents
Fetching ...

Mestra: Exploring Migration on Virtualized CGRAs

Agamemnon Kyriazis, Panagiotis Miliadis, Dimitris Theodoropoulos, Nectarios Koziris, Dionisios Pnevmatikatos

Abstract

As modern Coarse Grain Reconfigurable Arrays (CGRAs) grow in size, efficient utilization of the available fabric by a single application becomes increasingly difficult. Existing CGRA mappers either fail to utilize the available fabric or rely on rigid static code transformations with limited adaptability. Multi-tenant CGRAs have emerged as a promising solution to increase hardware utilization, but current attempts fail to address key challenges such as fabric fragmentation and live migration. To address this gap, we present Mestra, an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment. Mestra addresses fabric fragmentation caused by kernels completing out of order by supporting both stateless and stateful live kernel migration as a de-fragmentation mechanism. We assess our solution on an Alveo-U280 data-center-grade FPGA card, reporting area, frequency, and power. Performance is evaluated using routines from the PolyBench benchmark suite and kernels derived from common machine learning operators. Results show that spatial sharing of the available fabric across multiple users improves workload makespan by up to 70.48%, while live kernel migration reduces tail latency on fragmented layouts by up to 29.60%. The custom tightly coupled controller and read-back paths required for virtualization and stateful migration introduce a LUT cost of 0.13% per region. Our evaluation reveals that multi-tenancy is important for efficient CGRA utilization, and live kernel migration can further improve performance by recovering fragmented space with minimal hardware cost.

Mestra: Exploring Migration on Virtualized CGRAs

Abstract

As modern Coarse Grain Reconfigurable Arrays (CGRAs) grow in size, efficient utilization of the available fabric by a single application becomes increasingly difficult. Existing CGRA mappers either fail to utilize the available fabric or rely on rigid static code transformations with limited adaptability. Multi-tenant CGRAs have emerged as a promising solution to increase hardware utilization, but current attempts fail to address key challenges such as fabric fragmentation and live migration. To address this gap, we present Mestra, an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment. Mestra addresses fabric fragmentation caused by kernels completing out of order by supporting both stateless and stateful live kernel migration as a de-fragmentation mechanism. We assess our solution on an Alveo-U280 data-center-grade FPGA card, reporting area, frequency, and power. Performance is evaluated using routines from the PolyBench benchmark suite and kernels derived from common machine learning operators. Results show that spatial sharing of the available fabric across multiple users improves workload makespan by up to 70.48%, while live kernel migration reduces tail latency on fragmented layouts by up to 29.60%. The custom tightly coupled controller and read-back paths required for virtualization and stateful migration introduce a LUT cost of 0.13% per region. Our evaluation reveals that multi-tenancy is important for efficient CGRA utilization, and live kernel migration can further improve performance by recovering fragmented space with minimal hardware cost.

Paper Structure

This paper contains 25 sections, 13 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Heterogeneous grid of PEs arranged on a mesh point to point network shown here for region dimensions of ($3\times5$). Regions are modular and provide full 2D flexibility.
  • Figure 2: FSM of our tightly coupled controller. States are shown in yellow and commands in blue. We define a minimal set of both states and commands, prioritizing utility and simplicity. A command is accepted only in its valid state, raising an Illegal-Command flag otherwise.
  • Figure 3: Per PE type state critical elements.
  • Figure 4: System-level organization of Mestra. Hypervisor receives requests from multiple users and allocates resources (color-coded). Requests are communicated over PCIe to the Shell which manages the underlying vCGRA regions. Kernels execute in parallel at different reconfigurable vCGRA regions. A kernel may occupy more than one region, in which case they are merged into one unified region.
  • Figure 5: A tiled multi-tenant architecture enables concurrent execution on disjoint regions and can reduce $t_{wait}$ by overlapping $t_{exec}$ of independent kernels, as illustrated. Filled red boxes denote hypervisor-induced delay. These intervals are mutually exclusive and cannot be overlapped with each other. Examples of such induced delays are, PCIe host to device communication, copy of kernel data buffer from host memory to device and reverse, dynamic resource allocation, etc.
  • ...and 5 more figures