Table of Contents
Fetching ...

Osiris: A Systolic Approach to Accelerating Fully Homomorphic Encryption

Austin Ebel, Brandon Reagen

TL;DR

Osiris presents a scalable systolic approach to accelerating fully homomorphic encryption by decomposing FHE workloads into simple kernel units connected via limb interleaving. The key innovations are a 2D BConv-based accelerator, interleaved limb processing, and a giant-step centric dataflow (GSC) that efficiently maps state-of-the-art matrix-vector methods (BSGS with double hoisting) to hardware, while enabling on-chip reuse and reduced off-chip traffic. The architecture achieves state-of-the-art performance on standard benchmarks (e.g., bootstrap and ResNet-20 inferences) at 1 TB/s bandwidth, with near-linear gains as bandwidth and compute scale, demonstrating the practicality of high-throughput confidential computing. These results highlight the potential of carefully co-designed dataflow, memory tiling, and on-chip generation techniques to bridge the gap between FHE theory and real-world deployment.

Abstract

In this paper we show how fully homomorphic encryption (FHE) can be accelerated using a systolic architecture. We begin by analyzing FHE algorithms and then develop systolic or systolic-esque units for each major kernel. Connecting units is challenging due to the different data access and computational patterns of the kernels. We overcome this by proposing a new data tiling technique that we name limb interleaving. Limb interleaving creates a common data input/output pattern across all kernels that allows the entire architecture, named Osiris, to operate in lockstep. Osiris is capable of processing key-switches, bootstrapping, and full neural network inferences with high utilization across a range of FHE parameters. To achieve high performance, we propose a new giant-step centric (GSC) dataflow that efficiently maps state-of-the-art FHE matrix-vector product algorithms onto Osiris by optimizing for reuse and parallelism. Our evaluation of Osiris shows it outperforms the prior state-of-the-art accelerator on all standard benchmarks.

Osiris: A Systolic Approach to Accelerating Fully Homomorphic Encryption

TL;DR

Osiris presents a scalable systolic approach to accelerating fully homomorphic encryption by decomposing FHE workloads into simple kernel units connected via limb interleaving. The key innovations are a 2D BConv-based accelerator, interleaved limb processing, and a giant-step centric dataflow (GSC) that efficiently maps state-of-the-art matrix-vector methods (BSGS with double hoisting) to hardware, while enabling on-chip reuse and reduced off-chip traffic. The architecture achieves state-of-the-art performance on standard benchmarks (e.g., bootstrap and ResNet-20 inferences) at 1 TB/s bandwidth, with near-linear gains as bandwidth and compute scale, demonstrating the practicality of high-throughput confidential computing. These results highlight the potential of carefully co-designed dataflow, memory tiling, and on-chip generation techniques to bridge the gap between FHE theory and real-world deployment.

Abstract

In this paper we show how fully homomorphic encryption (FHE) can be accelerated using a systolic architecture. We begin by analyzing FHE algorithms and then develop systolic or systolic-esque units for each major kernel. Connecting units is challenging due to the different data access and computational patterns of the kernels. We overcome this by proposing a new data tiling technique that we name limb interleaving. Limb interleaving creates a common data input/output pattern across all kernels that allows the entire architecture, named Osiris, to operate in lockstep. Osiris is capable of processing key-switches, bootstrapping, and full neural network inferences with high utilization across a range of FHE parameters. To achieve high performance, we propose a new giant-step centric (GSC) dataflow that efficiently maps state-of-the-art FHE matrix-vector product algorithms onto Osiris by optimizing for reuse and parallelism. Our evaluation of Osiris shows it outperforms the prior state-of-the-art accelerator on all standard benchmarks.
Paper Structure (17 sections, 3 equations, 13 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 3 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: Visualizing how the BSGS algorithm reduces the number of ciphertext rotations in matrix-vector products.
  • Figure 2: Matrix representation of a polynomial ($N=16$, $\ell=3$) in natural order. Element $i_o$ indicates that the coefficient at index $i$ within its limb is in the $o$’th set of coefficients to be processed by a hardware unit.
  • Figure 3: A $4$-parallel MDC unit (numbered boxes indicate buffer length) performing an INTT on the first limb of the example polynomial. The same input ordering shown above is compactly described by the $i_o$ notation. C$0$ denotes cycle $0$.
  • Figure 4: Visualizing the interleaved input and output order of the $4$-parallel MDC unit introduced in Figure \ref{['fig:base-mdc']} using $i_o$ notation introduced in Section \ref{['subsect:running_example']}.
  • Figure 5: $\mathsf{BConv}$ input ordering in (a) $\mathsf{Osiris}$ and (b) prior FHE accelerators using the same example polynomial introduced in Section \ref{['subsect:running_example']}. Highlighted elements indicate coefficients available after four cycles. Note that in $\mathsf{Osiris}$, we can begin the $\mathsf{BConv}$ operation immediately since rows of a base table are multiplied with columns of the input polynomial.
  • ...and 8 more figures