Table of Contents
Fetching ...

ACiS: Complex Processing in the Switch Fabric

Pouya Haghi, Anqi Guo, Tong Geng, Anthony Skjellum, Martin Herbordt

TL;DR

ACiS proposes a general in-switch computing framework that extends switch fabric with a CGRA-based accelerator to offload and fuse HPC computations inside the network. By defining a progression of processing types from simple collectives to user-defined operations, look-aside stateful processing, and fused map–collective patterns, the approach enables transparent MPI acceleration without compromising existing datapath performance. Hardware plugins and a modular PISA integration support scalable, programmable in-switch computation, while software tooling provides MPI transparency, a source-to-source translator for fused collectives, and a usage database to guide deployment. Experimental results across indirect/direct networks, NAS benchmarks, and graph neural networks demonstrate substantial latency reductions and scalability improvements, illustrating the practical impact of shifting computation into the switch fabric for HPC workloads.

Abstract

For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process collectives, e.g., IBM BlueGene and Mellanox SHArP, but the support has been limited to a small set of predefined scalar operations and datatypes. Here we present ACiS, a framework and taxonomy for Advanced Computing in the Switch that unifies and expands our previous work in this area. In addition to fixed scalar collectives (Type 1), we propose three more types of in-switch application processing: (Type 2) User-defined operations and types, including data structures; (Type 3) Look-aside operations that have state within the operation and can have loops; and (Type 4) Fused collectives built by fusing multiple existing collectives or collectives with map computations. ACiS is supported in hardware with modular switch extensions including a CGRA architecture. Software support for ACiS includes evaluation and translation of relevant parts of user programs, compilation of user specifications into control flow graphs, and mapping the graphs into switch hardware. The overall goal is the transparent acceleration of HPC applications encapsulated within an MPI implementation.

ACiS: Complex Processing in the Switch Fabric

TL;DR

ACiS proposes a general in-switch computing framework that extends switch fabric with a CGRA-based accelerator to offload and fuse HPC computations inside the network. By defining a progression of processing types from simple collectives to user-defined operations, look-aside stateful processing, and fused map–collective patterns, the approach enables transparent MPI acceleration without compromising existing datapath performance. Hardware plugins and a modular PISA integration support scalable, programmable in-switch computation, while software tooling provides MPI transparency, a source-to-source translator for fused collectives, and a usage database to guide deployment. Experimental results across indirect/direct networks, NAS benchmarks, and graph neural networks demonstrate substantial latency reductions and scalability improvements, illustrating the practical impact of shifting computation into the switch fabric for HPC workloads.

Abstract

For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process collectives, e.g., IBM BlueGene and Mellanox SHArP, but the support has been limited to a small set of predefined scalar operations and datatypes. Here we present ACiS, a framework and taxonomy for Advanced Computing in the Switch that unifies and expands our previous work in this area. In addition to fixed scalar collectives (Type 1), we propose three more types of in-switch application processing: (Type 2) User-defined operations and types, including data structures; (Type 3) Look-aside operations that have state within the operation and can have loops; and (Type 4) Fused collectives built by fusing multiple existing collectives or collectives with map computations. ACiS is supported in hardware with modular switch extensions including a CGRA architecture. Software support for ACiS includes evaluation and translation of relevant parts of user programs, compilation of user specifications into control flow graphs, and mapping the graphs into switch hardware. The overall goal is the transparent acceleration of HPC applications encapsulated within an MPI implementation.

Paper Structure

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The well-known Protocol Independent Switch Architecture (PISA) enhanced with ACiS accelerator, a composable plugin to a packet processing pipeline (one pipe shown). White blocks are proposed and grey blocks are in existing switches.
  • Figure 2: The proposed CGRA architecture used in Figure \ref{['fig:model-Type1to4']} with three SIMD processing units (SPUs) in a deep pipeline. It is packaged with AXI interface to facilitate the integration with switch pipelines.
  • Figure 3: ACiS vs MPI CPU cluster (SKX) execution times for 32, 64, and 128 nodes: (a) osu_allgather, (b) osu_allreduce, (c) osu_bcast, and (d) osu_gather.
  • Figure 4: Application performance and scalability comparison of GCN on a baseline CPU cluster (SKX) vs. ACiS.
  • Figure 5: Latency comparison of Allgather_op_Allgatherv in MPI4py and ACiS. Op is prefix sum. The X-axis shows the message size in bytes used for Allgathers and the Y-axis shows the latency in milliseconds.
  • ...and 1 more figures