Table of Contents
Fetching ...

FAME: FPGA Acceleration of Secure Matrix Multiplication with Homomorphic Encryption

Zhihan Xu, Rajgopal Kannan, Viktor K. Prasanna

TL;DR

FAME tackles the prohibitive runtime of secure matrix multiplication under Homomorphic Encryption by introducing a memory-optimized HLT datapath and an FPGA accelerator. The authors first develop a cost model to forecast on-chip memory needs, then design MO-HLT to fuse sub-operations and enable on-the-fly limb generation, dramatically reducing off-chip traffic. Building on this, FAME implements a configurable, FPGA-based architecture with modular ALU arrays, a permutation circuit, and a multi-banked scratchpad, achieving up to 221x average speedup over CPU baselines on the largest parameter sets. The work demonstrates practical, scalable fully encrypted MM workflows and points to future extensions like block MM and shape-specific optimisations.

Abstract

Homomorphic Encryption (HE) enables secure computation on encrypted data, addressing privacy concerns in cloud computing. However, the high computational cost of HE operations, particularly matrix multiplication (MM), remains a major barrier to its practical deployment. Accelerating homomorphic encrypted MM (HE MM) is therefore crucial for applications such as privacy-preserving machine learning. In this paper, we present a bandwidth-efficient FPGA implementation of HE MM. We first develop a cost model to evaluate the on-chip memory requirements for a given set of HE parameters and input matrix sizes. Our analysis shows that optimizing on-chip memory usage is critical for scalable and efficient HE MM. To this end, we design a novel datapath for Homomorphic Linear Transformation (HLT), the primary bottleneck in HE MM. The proposed datapath significantly reduces off-chip memory traffic and on-chip memory demand by enabling fine-grained data reuse. Leveraging this datapath, we introduce FAME, the first FPGA-based accelerator specifically tailored for HE MM. FAME supports arbitrary matrix shapes and is configurable across a wide range of HE parameter sets. We implement FAME on an Alveo U280 FPGA and evaluate its performance across diverse matrix sizes and shapes. Experimental results show that FAME achieves an average speedup of 221x over state-of-the-art CPU-based implementations, demonstrating its scalability and practicality for large-scale consecutive HE MM and real-world workloads.

FAME: FPGA Acceleration of Secure Matrix Multiplication with Homomorphic Encryption

TL;DR

FAME tackles the prohibitive runtime of secure matrix multiplication under Homomorphic Encryption by introducing a memory-optimized HLT datapath and an FPGA accelerator. The authors first develop a cost model to forecast on-chip memory needs, then design MO-HLT to fuse sub-operations and enable on-the-fly limb generation, dramatically reducing off-chip traffic. Building on this, FAME implements a configurable, FPGA-based architecture with modular ALU arrays, a permutation circuit, and a multi-banked scratchpad, achieving up to 221x average speedup over CPU baselines on the largest parameter sets. The work demonstrates practical, scalable fully encrypted MM workflows and points to future extensions like block MM and shape-specific optimisations.

Abstract

Homomorphic Encryption (HE) enables secure computation on encrypted data, addressing privacy concerns in cloud computing. However, the high computational cost of HE operations, particularly matrix multiplication (MM), remains a major barrier to its practical deployment. Accelerating homomorphic encrypted MM (HE MM) is therefore crucial for applications such as privacy-preserving machine learning. In this paper, we present a bandwidth-efficient FPGA implementation of HE MM. We first develop a cost model to evaluate the on-chip memory requirements for a given set of HE parameters and input matrix sizes. Our analysis shows that optimizing on-chip memory usage is critical for scalable and efficient HE MM. To this end, we design a novel datapath for Homomorphic Linear Transformation (HLT), the primary bottleneck in HE MM. The proposed datapath significantly reduces off-chip memory traffic and on-chip memory demand by enabling fine-grained data reuse. Leveraging this datapath, we introduce FAME, the first FPGA-based accelerator specifically tailored for HE MM. FAME supports arbitrary matrix shapes and is configurable across a wide range of HE parameter sets. We implement FAME on an Alveo U280 FPGA and evaluate its performance across diverse matrix sizes and shapes. Experimental results show that FAME achieves an average speedup of 221x over state-of-the-art CPU-based implementations, demonstrating its scalability and practicality for large-scale consecutive HE MM and real-world workloads.

Paper Structure

This paper contains 25 sections, 13 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: For a practical parameter set (i.e., Set-C): (A) Baseline $\mathsf{HLT}$ design with coarse-grained rotation loop and full ciphertext-level datapath, incurring high DRAM traffic. (B) Proposed memory-optimized $\mathsf{HLT}$ (MO-HLT), an architecture-algorithm co-designed solution, which enables on-the-fly limb generation and sub-operation fusion across $\mathsf{NTT}$, $\mathsf{Automorph}$, $\mathsf{KeyIP}$, $\mathsf{DiagIP}$, and $\mathsf{iNTT}$, drastically reducing SRAM requirement and DRAM access.
  • Figure 2: The overall system architecture on FPGA
  • Figure 3: Fully pipelined permutation circuit ($dp$-to-$dp$)
  • Figure 4: Scratchpad memory organization with BRAM or URAM banks
  • Figure 5: HE MM latency of CPU-based approaches and achieved speedups by FAME over the best CPU result (shown above the corresponding bar)