Table of Contents
Fetching ...

FoldGPT: Simple and Effective Large Language Model Compression Scheme

Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen

TL;DR

FoldGPT tackles the challenge of deploying large language models on mobile devices by revealing substantial depth-wise redundancy in LLMs and proposing a two-step compression: gated block removal to prune redundant layers and grouped parameter sharing to reuse weights across blocks. A learnable gating mechanism accounts for inter-block coupling, while a polarization-friendly optimization and FLOPs-based constraints determine which blocks to remove; remaining blocks undergo grouped sharing with layernorm re-adaptation and targeted distillation to recover performance. Empirical results on LLaMA-2-7B, Gemma-2B, and TinyLLaMA-1.1B show FoldGPT achieving meaningful parameter reductions with minimal accuracy loss, outperforming state-of-the-art depth-based pruning methods and enabling practical edge deployment. The approach combines block-level gating, cross-block weight sharing, and tail-layer distillation to deliver lightweight LLMs suitable for mobile inference, with limitations tied to architectures that do not exhibit repetitive block structures.

Abstract

The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.

FoldGPT: Simple and Effective Large Language Model Compression Scheme

TL;DR

FoldGPT tackles the challenge of deploying large language models on mobile devices by revealing substantial depth-wise redundancy in LLMs and proposing a two-step compression: gated block removal to prune redundant layers and grouped parameter sharing to reuse weights across blocks. A learnable gating mechanism accounts for inter-block coupling, while a polarization-friendly optimization and FLOPs-based constraints determine which blocks to remove; remaining blocks undergo grouped sharing with layernorm re-adaptation and targeted distillation to recover performance. Empirical results on LLaMA-2-7B, Gemma-2B, and TinyLLaMA-1.1B show FoldGPT achieving meaningful parameter reductions with minimal accuracy loss, outperforming state-of-the-art depth-based pruning methods and enabling practical edge deployment. The approach combines block-level gating, cross-block weight sharing, and tail-layer distillation to deliver lightweight LLMs suitable for mobile inference, with limitations tied to architectures that do not exhibit repetitive block structures.

Abstract

The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.
Paper Structure (14 sections, 7 equations, 3 figures, 4 tables)

This paper contains 14 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An overview of our FoldGPT. (a)Two-step volume compression strategy, including gated block removal and grouped parameter sharing. (b) Block structure with learnable gating parameters. (c) Grouped parameter sharing structure. The first block in the group is called the parent block, and the remaining blocks share the weight parameters of the parent block, which are called child blocks.
  • Figure 2: Block redundancy analysis of models with sizes from 1B to 7B. The red line represents the cosine similarity of the input and output of the current Block, while the blue line represents the cosine similarity of the output of the current Block and the input of the starting point.
  • Figure 3: Polarization characteristics on different $eps$.