Osiris: A Systolic Approach to Accelerating Fully Homomorphic Encryption
Austin Ebel, Brandon Reagen
TL;DR
Osiris presents a scalable systolic approach to accelerating fully homomorphic encryption by decomposing FHE workloads into simple kernel units connected via limb interleaving. The key innovations are a 2D BConv-based accelerator, interleaved limb processing, and a giant-step centric dataflow (GSC) that efficiently maps state-of-the-art matrix-vector methods (BSGS with double hoisting) to hardware, while enabling on-chip reuse and reduced off-chip traffic. The architecture achieves state-of-the-art performance on standard benchmarks (e.g., bootstrap and ResNet-20 inferences) at 1 TB/s bandwidth, with near-linear gains as bandwidth and compute scale, demonstrating the practicality of high-throughput confidential computing. These results highlight the potential of carefully co-designed dataflow, memory tiling, and on-chip generation techniques to bridge the gap between FHE theory and real-world deployment.
Abstract
In this paper we show how fully homomorphic encryption (FHE) can be accelerated using a systolic architecture. We begin by analyzing FHE algorithms and then develop systolic or systolic-esque units for each major kernel. Connecting units is challenging due to the different data access and computational patterns of the kernels. We overcome this by proposing a new data tiling technique that we name limb interleaving. Limb interleaving creates a common data input/output pattern across all kernels that allows the entire architecture, named Osiris, to operate in lockstep. Osiris is capable of processing key-switches, bootstrapping, and full neural network inferences with high utilization across a range of FHE parameters. To achieve high performance, we propose a new giant-step centric (GSC) dataflow that efficiently maps state-of-the-art FHE matrix-vector product algorithms onto Osiris by optimizing for reuse and parallelism. Our evaluation of Osiris shows it outperforms the prior state-of-the-art accelerator on all standard benchmarks.
