MPC-Minimized Secure LLM Inference
Deevashwer Rathee, Dacheng Li, Ion Stoica, Hao Zhang, Raluca Popa
TL;DR
The paper addresses privacy-preserving LLM inference by reducing MPC-related overhead through MPC-minimization. It introduces Marill, a fine-tuning framework that splits model weights into public and private components and applies high-level architectural changes (Layer Freezing, LoRA adaptation, Head Merging) to relocate expensive computations outside MPC, aided by knowledge distillation to preserve ML performance. Empirical results show 3.6–11.3x runtime and 2.4–6.9x lower communication across MPC settings, while maintaining roughly 90%+ of standard fine-tuning accuracy on tasks spanning code, chat, and translation. The approach is complementary to MPC-friendly approximations and benefits from open-source pre-trained weights, enabling practical privacy-preserving LLM services with broad applicability across secure inference protocols.
Abstract
Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure multi-party computation (MPC), however, it is still impractical for modern LLM workload due to the large overhead imposed by MPC. To address this overhead, we propose Marill, a framework that adapts LLM fine-tuning to minimize MPC usage during secure inference. Marill introduces high-level architectural changes during fine-tuning that significantly reduce the number of expensive operations needed within MPC during inference, by removing some and relocating others outside MPC without compromising security. As a result, Marill-generated models are more efficient across all secure inference protocols and our approach complements MPC-friendly approximations for such operations. Compared to standard fine-tuning, Marill results in 3.6-11.3x better runtime and 2.4-6.9x better communication during secure inference across various MPC settings, while typically preserving over 90% performance across downstream tasks.
