Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET
Paul Scheffler, Thomas Benz, Viviane Potocnik, Tim Fischer, Luca Colagrande, Nils Wistoff, Yichao Zhang, Luca Bertaccini, Gianmarco Ottavi, Manuel Eggimann, Matheus Cavalcante, Gianna Paulin, Frank K. Gürkaynak, Davide Rossi, Luca Benini
TL;DR
Occamy addresses the challenge of efficiently executing heterogeneous dense and sparse computations by introducing a 2.5D, dual-chiplet RISC-V system with 432 cores, dual-HBM2E memory, and a latency-tolerant interconnect. The architecture relies on sparsity-capable streaming units and explicit, tile-based data movement to achieve high FPU utilization across dense, sparse-dense, and sparse-sparse workloads, with open-source RTL available. Silicon implementation in 12nm FinFET chiplets on a 65nm interposer demonstrates state-of-the-art density and energy efficiency, including up to 83% FPU utilization on stencil codes and 42% on sparse-dense LA, as well as 187 GCOMP/s with 17.4 GCOMP/s/W for sparse-sparse workloads, and strong ML inference performance on GPT-J and graph layers. Overall, Occamy combines high dense efficiency, flexible sparse support, and open hardware access, offering a practical, scalable path for heterogeneous compute workloads in HPC and ML domains.
Abstract
ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy's compute chiplets in 12 nm FinFET, and its passive interposer, Hedwig, in a 65 nm node. On dense linear algebra (LA), Occamy achieves a competitive FPU utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2,leading state-of-the-art (SoA) processors by 1.7x and 1.2x, respectively. On sparse-dense linear algebra (LA), it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by 5.2x and 11x, respectively. On, sparse-sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (LLM) and graph-sparse (GCN) ML inference workloads. Occamy's RTL is freely available under a permissive open-source license.
