LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

Tingting Tang; Yongqin Wang; Murali Annavaram

LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

Tingting Tang, Yongqin Wang, Murali Annavaram

TL;DR

This work tackles the high cost of MPC-based ML inference, where linear layer matrix multiplications dominate latency and energy due to added communication and truncation overheads. It introduces LRD-MPC, which uses low-rank decomposition to split large weight multiplications into two smaller ones, and pairs this with truncation skipping and efficient linear layer concatenation to hide additional communication rounds. Across semi-honest $n$-PC and Trio $3$-PC protocols, the approach achieves up to $33 ext{ extbackslash}%$ online speedup, $52 ext{ extbackslash}%$ energy savings, and $88 ext{ extbackslash}%$ offline-phase reductions, with minimal accuracy loss (often <$0.5 ext{ extbackslash}%$). The methods are shown to be broadly applicable, framework-agnostic, and capable of substantial practical impact for secure ML inference in cloud environments.

Abstract

Secure Multi-party Computation (MPC) enables untrusted parties to jointly compute a function without revealing their inputs. Its application to machine learning (ML) has gained significant attention, particularly for secure inference services deployed across multiple cloud virtual machines (VMs), where each VM acts as an MPC party. Model providers secret-share model weights, and users secret-share inputs, ensuring that each server operates only on random shares. While MPC provides strong cryptographic guarantees, it incurs substantial computational and communication overhead. Deep neural networks rely heavily on convolutional and fully connected layers, which require costly matrix multiplications in MPC. To reduce this cost, we propose leveraging low-rank decomposition (LRD) for linear layers, replacing one large matrix multiplication with two smaller ones. Each matrix multiplication in MPC incurs a round of communication, meaning decomposing one matrix multiplication into two leads to an additional communication round. Second, the added matrix multiplication requires an additional truncation step to maintain numerical precision. Since truncation itself requires communication and computation, these overheads can offset the gains from decomposition. To address this, we introduce two complementary optimizations: truncation skipping and efficient linear layer concatenation. Truncation skipping removes the extra truncation induced by LRD, while linear layer concatenation pipelines operations to hide the additional communication round. Together, these techniques mitigate the main overheads of LRD in MPC and improve overall efficiency. Our approach is broadly applicable across MPC protocols. Experiments show up to 25% speedup in n-PC and 33% in 3-PC protocols over full-rank baselines, along with up to 52% GPU energy savings and 88% reduction in offline-phase latency.

LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

TL;DR

Abstract

LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (12)