Table of Contents
Fetching ...

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

Gianna Paulin, Paul Scheffler, Thomas Benz, Matheus Cavalcante, Tim Fischer, Manuel Eggimann, Yichao Zhang, Nils Wistoff, Luca Bertaccini, Luca Colagrande, Gianmarco Ottavi, Frank K. Gürkaynak, Davide Rossi, Luca Benini

TL;DR

Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow SIMD FP data, achieves leading-edge utilization on stencils, sparse-dense, and sparse-sparse computations.

Abstract

We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

TL;DR

Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow SIMD FP data, achieves leading-edge utilization on stencils, sparse-dense, and sparse-sparse computations.

Abstract

We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.
Paper Structure (3 sections, 7 figures, 1 table)

This paper contains 3 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Architecture of our dual-chiplet system. (c, d): chiplets connected by a 512-bit D2D link (a, b). One chiplet (c, d) contains six groups, peripherals, a 64-bit Linux-capable host and 1.5MiB SPM. One group (e) contains four clusters (f) containing eight compute coreplexes (g-i).
  • Figure 2: Module photograph with dimensions.
  • Figure 3: Architecture of the cooperating sparsity-capable SUs in each worker core.
  • Figure 4: Hierarchical area breakdown of interposer and chiplets.
  • Figure 5: Sparse-dense dot product assembly without and with our SUs; our ISA extensions enable continuous FMA execution.
  • ...and 2 more figures