Table of Contents
Fetching ...

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi Selvaraj, Rahul Mazumder, Aman Gupta, Xia Hu

TL;DR

FFSplit tackles the FFN bottleneck in transformer language models by identifying heavy-hitter FFN neurons and explicitly splitting the FFN into heavy-hitter and remainder parts. The heavy-hitter subset is preserved, while the remainder undergoes targeted compression (e.g., low-rank, quantization), improving the accuracy- efficiency balance under parameter budgets. Empirical results on BERT using GLUE show substantial parameter reductions (around 43%) and consistent speedups (1.25–1.56x) with minimal accuracy loss, and LLM experiments with OPT demonstrate improved performance when combining FFSplit with quantization. The approach provides a hardware-friendly, fine-grained method to accelerate inference on commodity hardware without sacrificing major model quality.

Abstract

The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

TL;DR

FFSplit tackles the FFN bottleneck in transformer language models by identifying heavy-hitter FFN neurons and explicitly splitting the FFN into heavy-hitter and remainder parts. The heavy-hitter subset is preserved, while the remainder undergoes targeted compression (e.g., low-rank, quantization), improving the accuracy- efficiency balance under parameter budgets. Empirical results on BERT using GLUE show substantial parameter reductions (around 43%) and consistent speedups (1.25–1.56x) with minimal accuracy loss, and LLM experiments with OPT demonstrate improved performance when combining FFSplit with quantization. The approach provides a hardware-friendly, fine-grained method to accelerate inference on commodity hardware without sacrificing major model quality.

Abstract

The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring wall clock time speedup on different hardware with negligible accuracy drop.
Paper Structure (17 sections, 4 equations, 3 figures, 3 tables)

This paper contains 17 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Heavy Hitter neurons also exist in GeLU-based language models.
  • Figure 2: The comparison between the baseline model, the model without top 3% heavy hitter, and the model without 3% light hitter.
  • Figure 3: The diagram of our proposed method. We explicitly split the original FFN into two parts according to the set of heavy hitters $\texttt{h}_2$. ${\bm{U}}_1={\bm{U}}_{:, \texttt{h}_2}$ and ${\bm{V}}_1={\bm{V}}_{\texttt{h}_2, :}$. Similarly, ${\bm{U}}_2$ and ${\bm{V}}_2$ are FFN weights specified by remain neuron. We allow less resource to the FFN without heavy hitters, which is denoted with dotted lines.