FAME: FPGA Acceleration of Secure Matrix Multiplication with Homomorphic Encryption
Zhihan Xu, Rajgopal Kannan, Viktor K. Prasanna
TL;DR
FAME tackles the prohibitive runtime of secure matrix multiplication under Homomorphic Encryption by introducing a memory-optimized HLT datapath and an FPGA accelerator. The authors first develop a cost model to forecast on-chip memory needs, then design MO-HLT to fuse sub-operations and enable on-the-fly limb generation, dramatically reducing off-chip traffic. Building on this, FAME implements a configurable, FPGA-based architecture with modular ALU arrays, a permutation circuit, and a multi-banked scratchpad, achieving up to 221x average speedup over CPU baselines on the largest parameter sets. The work demonstrates practical, scalable fully encrypted MM workflows and points to future extensions like block MM and shape-specific optimisations.
Abstract
Homomorphic Encryption (HE) enables secure computation on encrypted data, addressing privacy concerns in cloud computing. However, the high computational cost of HE operations, particularly matrix multiplication (MM), remains a major barrier to its practical deployment. Accelerating homomorphic encrypted MM (HE MM) is therefore crucial for applications such as privacy-preserving machine learning. In this paper, we present a bandwidth-efficient FPGA implementation of HE MM. We first develop a cost model to evaluate the on-chip memory requirements for a given set of HE parameters and input matrix sizes. Our analysis shows that optimizing on-chip memory usage is critical for scalable and efficient HE MM. To this end, we design a novel datapath for Homomorphic Linear Transformation (HLT), the primary bottleneck in HE MM. The proposed datapath significantly reduces off-chip memory traffic and on-chip memory demand by enabling fine-grained data reuse. Leveraging this datapath, we introduce FAME, the first FPGA-based accelerator specifically tailored for HE MM. FAME supports arbitrary matrix shapes and is configurable across a wide range of HE parameter sets. We implement FAME on an Alveo U280 FPGA and evaluate its performance across diverse matrix sizes and shapes. Experimental results show that FAME achieves an average speedup of 221x over state-of-the-art CPU-based implementations, demonstrating its scalability and practicality for large-scale consecutive HE MM and real-world workloads.
