Table of Contents
Fetching ...

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang

TL;DR

A salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance is proposed.

Abstract

LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce the model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. Additionally, existing methods assume that the appropriate model for a user request is known in advance, which is not the case in practice. To this end, we introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs. To condense the number of bits required for describing the delta weights, we propose a salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance. Moreover, we develop a model-level routing method that efficiently directs user queries to the most suitable expert by performing domain classification. Extensive experiments show the promising memory efficiency and routing performance of ME-Switch. For example, when serving three models from the Mistral-7B family, ME-Switch reduces the model size by $1.74\times$ and maintains nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Notably, our method can efficiently serve 16 Mistral-7B models on a single NVIDIA A100 GPU.

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

TL;DR

A salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance is proposed.

Abstract

LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce the model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. Additionally, existing methods assume that the appropriate model for a user request is known in advance, which is not the case in practice. To this end, we introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs. To condense the number of bits required for describing the delta weights, we propose a salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance. Moreover, we develop a model-level routing method that efficiently directs user queries to the most suitable expert by performing domain classification. Extensive experiments show the promising memory efficiency and routing performance of ME-Switch. For example, when serving three models from the Mistral-7B family, ME-Switch reduces the model size by and maintains nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Notably, our method can efficiently serve 16 Mistral-7B models on a single NVIDIA A100 GPU.
Paper Structure (17 sections, 4 equations, 10 figures, 10 tables)

This paper contains 17 sections, 4 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An illustration of the input channel-wise maximum and minimum values for the delta weights of Speechless-Code-Mistral-7B. The variability across input channels highlights that certain salient channels, irrespective of their magnitude, can cause significant quantization errors when quantized with ultra low-bitwidth, which underscores their critical role in preserving performance.
  • Figure 2: An illustration comparison between the magnitude-based selection of salient delta weights and our reconstruction-error-based selection method. Given a delta weight matrix $\Delta \in \mathbb{R}^{m \times n}$, its quantized version $\hat{\Delta}$, and input ${\bf x} \in \mathbb{R}^m$, where $m$ and $n$ denote the number of input and output channels, respectively, our method measures the importance of each input delta channel by $\sum_{j=1}^{n} \| {\bf x}_i \Delta_{ij} - {\bf x}_i \hat{\Delta}_{ij} \|_2^2$.
  • Figure 3: An illustration of the model-level routing. We first prompt the model-level router with the user query using a template (See Section \ref{['sec:prompt_template']} for more details) that presents a list of potential domains. The router then assesses these options and selects the most relevant domain by answering a multiple-choice question, effectively classifying the query into the corresponding category.
  • Figure 4: Average accuracy vs. delta weights size across different domains. "Baseline" refers to the fixed-precision quantization baseline. The dashed line indicates the full-precision counterpart.
  • Figure 5: Effect of supervised fine-tuning (SFT) in model-level routing. We assess the performance of routing by measuring the accuracy on a 4-domain classification task (instruction, mathematics, code, and Chinese).
  • ...and 5 more figures