FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Zirui Liu; Qingquan Song; Qiang Charles Xiao; Sathiya Keerthi Selvaraj; Rahul Mazumder; Aman Gupta; Xia Hu

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi Selvaraj, Rahul Mazumder, Aman Gupta, Xia Hu

TL;DR

FFSplit tackles the FFN bottleneck in transformer language models by identifying heavy-hitter FFN neurons and explicitly splitting the FFN into heavy-hitter and remainder parts. The heavy-hitter subset is preserved, while the remainder undergoes targeted compression (e.g., low-rank, quantization), improving the accuracy- efficiency balance under parameter budgets. Empirical results on BERT using GLUE show substantial parameter reductions (around 43%) and consistent speedups (1.25–1.56x) with minimal accuracy loss, and LLM experiments with OPT demonstrate improved performance when combining FFSplit with quantization. The approach provides a hardware-friendly, fine-grained method to accelerate inference on commodity hardware without sacrificing major model quality.

Abstract

The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

TL;DR

Abstract

total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring

wall clock time speedup on different hardware with negligible accuracy drop.

Paper Structure (17 sections, 4 equations, 3 figures, 3 tables)

This paper contains 17 sections, 4 equations, 3 figures, 3 tables.

Introduction
Background and Motivation
Related Work
Efficiency Bottleneck of LM Inference
Approximation in LM Inference
Methodology
Heavy Hitter Exists and Matters for Performance
Framework
Experiments
Bert Experimental Analysis
Experimental Settings
Datasets and Evaluation Protocol.
Adopted Models and Compression Methods.
Hyperparameter Settings.
Accuracy-Efficiency Trade-Off
...and 2 more sections

Figures (3)

Figure 1: Heavy Hitter neurons also exist in GeLU-based language models.
Figure 2: The comparison between the baseline model, the model without top 3% heavy hitter, and the model without 3% light hitter.
Figure 3: The diagram of our proposed method. We explicitly split the original FFN into two parts according to the set of heavy hitters $\texttt{h}_2$. ${\bm{U}}_1={\bm{U}}_{:, \texttt{h}_2}$ and ${\bm{V}}_1={\bm{V}}_{\texttt{h}_2, :}$. Similarly, ${\bm{U}}_2$ and ${\bm{V}}_2$ are FFN weights specified by remain neuron. We allow less resource to the FFN without heavy hitters, which is denoted with dotted lines.

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

TL;DR

Abstract

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (3)