Table of Contents
Fetching ...

FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference

Chenqi Lin, Tianshi Xu, Zebin Yang, Runsheng Wang, Ru Huang, Meng Li

TL;DR

FastQuery tackles privacy-preserving embedding table queries for private LLM inference by targeting the dominant communication bottleneck in HE-based matrix-vector multiplication. It introduces a joint protocol and quantization strategy that leverages the one-hot nature of token queries and the embedding table’s robustness to low-bit quantization, reducing both plaintext/ciphertext bit-width and the number of output ciphertexts. Key contributions include a communication-aware embedding table quantization method, a one-hot-aware dense packing algorithm with per-channel mixed precision, and a data-free embedding table fine-tuning procedure, all of which yield large reductions in communication and latency with minimal performance loss. The approach demonstrates substantial practical impact for secure, private LLM inference at scale, enabling efficient deployment on large vocabularies (e.g., $m$ on the order of tens of thousands) and multiple model sizes.

Abstract

With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than $4.3\times$, $2.7\times$, $1.3\times$ latency reduction, respectively and more than $75.7\times$, $60.2\times$, $20.2\times$ communication reduction, respectively, on both LLAMA-7B and LLAMA-30B.

FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference

TL;DR

FastQuery tackles privacy-preserving embedding table queries for private LLM inference by targeting the dominant communication bottleneck in HE-based matrix-vector multiplication. It introduces a joint protocol and quantization strategy that leverages the one-hot nature of token queries and the embedding table’s robustness to low-bit quantization, reducing both plaintext/ciphertext bit-width and the number of output ciphertexts. Key contributions include a communication-aware embedding table quantization method, a one-hot-aware dense packing algorithm with per-channel mixed precision, and a data-free embedding table fine-tuning procedure, all of which yield large reductions in communication and latency with minimal performance loss. The approach demonstrates substantial practical impact for secure, private LLM inference at scale, enabling efficient deployment on large vocabularies (e.g., on the order of tens of thousands) and multiple model sizes.

Abstract

With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than , , latency reduction, respectively and more than , , communication reduction, respectively, on both LLAMA-7B and LLAMA-30B.
Paper Structure (18 sections, 1 equation, 10 figures, 6 tables)

This paper contains 18 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Compare the latency of QKV projection with embedding table query on GPU and prior-art 2PC framework Iron as well as our 2PC framework FastQuery on LLAMA-7B and 13B. The data is normalized where the QKV projection is $1.0$.
  • Figure 2: Flow of secure embedding table query.
  • Figure 3: A matrix-vector multiplication example of coefficient packing.
  • Figure 4: Overview of FastQuery Framework.
  • Figure 5: The proposed private embedding table query protocol.
  • ...and 5 more figures