Table of Contents
Fetching ...

SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

Yiping Wang, Hanxian Huang, Yifang Chen, Jishen Zhao, Simon Shaolei Du, Yuandong Tian

TL;DR

SHARP addresses the memory and latency challenges of deploying pretrained LLMs on resource-constrained devices by sharing parameters across adjacent layers and introducing low-rank recovery parameters to preserve performance. It relies on a two-stage recovery (Single Layer Warmup and Supervised Fine-Tuning) to align and regain model capacity, enabling a significant reduction in MLP parameter storage (38–65%) and substantial mobile latency savings (around 42%). The approach leverages the observed similarity of consecutive layer outputs and uses LoRA-style adapters to predict later layers from a single reference layer, with various replacement strategies and candidate transformations explored. Empirical results on Llama2-7b and smaller models show SHARP recovers perplexity across in-distribution tasks with limited fine-tuning data (often 50k examples) and maintains compatibility with quantization, making it practical for edge deployment and real-world mobile usage.

Abstract

While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). The SLW stage aligns the outputs of the shared layers using L_2 loss, providing a good initialization for the following SFT stage to further restore the model performance. Extensive experiments demonstrate that SHARP can recover the model's perplexity on various in-distribution tasks using no more than 50k fine-tuning data while reducing the number of stored MLP parameters by 38% to 65%. We also conduct several ablation studies of SHARP and show that replacing layers towards the later parts of the model yields better performance retention, and that different recovery parameterizations perform similarly when parameter counts are matched. Furthermore, SHARP saves 42.8% in model storage and reduces the total inference time by 42.2% compared to the original Llama2-7b model on mobile devices. Our results highlight SHARP as an efficient solution for reducing inference costs in deploying LLMs without the need for pretraining-scale resources.

SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

TL;DR

SHARP addresses the memory and latency challenges of deploying pretrained LLMs on resource-constrained devices by sharing parameters across adjacent layers and introducing low-rank recovery parameters to preserve performance. It relies on a two-stage recovery (Single Layer Warmup and Supervised Fine-Tuning) to align and regain model capacity, enabling a significant reduction in MLP parameter storage (38–65%) and substantial mobile latency savings (around 42%). The approach leverages the observed similarity of consecutive layer outputs and uses LoRA-style adapters to predict later layers from a single reference layer, with various replacement strategies and candidate transformations explored. Empirical results on Llama2-7b and smaller models show SHARP recovers perplexity across in-distribution tasks with limited fine-tuning data (often 50k examples) and maintains compatibility with quantization, making it practical for edge deployment and real-world mobile usage.

Abstract

While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). The SLW stage aligns the outputs of the shared layers using L_2 loss, providing a good initialization for the following SFT stage to further restore the model performance. Extensive experiments demonstrate that SHARP can recover the model's perplexity on various in-distribution tasks using no more than 50k fine-tuning data while reducing the number of stored MLP parameters by 38% to 65%. We also conduct several ablation studies of SHARP and show that replacing layers towards the later parts of the model yields better performance retention, and that different recovery parameterizations perform similarly when parameter counts are matched. Furthermore, SHARP saves 42.8% in model storage and reduces the total inference time by 42.2% compared to the original Llama2-7b model on mobile devices. Our results highlight SHARP as an efficient solution for reducing inference costs in deploying LLMs without the need for pretraining-scale resources.

Paper Structure

This paper contains 38 sections, 6 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: (a) Regular pretrained baseline model without layer sharing. (b) Adjacent layer sharing used in MobileLLM liu2024mobilellm. They repeat the layer twice and train the model from scratch. (c) Direct Sharing: directly apply vanilla adjacent layer sharing to the pretrained model to accelerate inference. (d) (Ours) SHARP: SHaring Adjacent Layers with Recovery Parameters. SHARP leverages fine-tuning-scale data to train additional parameters $\Delta \Theta$, which consist of far fewer parameters than the original $\Theta$, in order to recover the model's performance. In this paper, we explore several candidate transformations, including the LoRA-style function, to apply on additional parameters.
  • Figure 2: Language models are robust to the replacement of adjacent MLP layers. (Left) For each reference layer, we directly replace the MLP layer in the subsequent layer with that of the reference layer, then evaluate perplexity on various tasks. We find that, aside from the first and last layers, most replacements do not significantly increase perplexity compared to the original model (dotted line). If we fine-tune the model with additional low-rank learnable parameters (rank = 400) added to the next layer, the perplexity gap is effectively closed (as shown by the "Arxiv-math (Finetuned)" line). (Right) Similarly, we observe consistent perplexity results on Arxiv-math (baseline perplexity = 3.0) when using more general reference-target replacement pairs (i.e., use reference layer to replace any later layer).
  • Figure 3: Average relative error between adjacent layers (mean of $\{\|\Theta_{i+1} - \Theta_i\| / \|\Theta_i\|\}_{i=1}^{31}$).
  • Figure 4: Impact of different layers on model capabilities. The x-axis denotes the index of the zero-out MLP layer, whose weights are set to be zero, in the modified model, and the y-axis shows the difference between the original model and the modified model on the particular evaluation tasks, which means that the lower the value the better. (Left) Evaluation tasks focused on memorizing domain-specific knowledge or common sense. (Right) Evaluation tasks requiring reasoning abilities in areas like mathematics, physics, or general reasoning. We skip index 0 since it's critical based on Figure \ref{['fig:side-by-side']}.