Table of Contents
Fetching ...

Leveraging the true depth of LLMs

Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret

TL;DR

This work introduces Layer Parallelism (LP), a post-hoc method that parallelizes consecutive Transformer layers to reduce depth and inter-GPU communication during LLM inference without retraining. By restructuring the computational graph to run adjacent layers in parallel, LP achieves substantial throughput gains while incurring only modest losses in perplexity and downstream task performance, with some recovery possible through lightweight fine-tuning. The authors provide empirical evidence across multiple models (e.g., Llama2-7B, Llama3-3B, Qwen-4B/14B) showing speedups up to around 1.46x on multi-GPU setups, and they offer a theoretical analysis of the LP approximation error and demonstrate limitations and trade-offs. Overall, LP offers a practical, scalable approach to accelerating LLM deployment, highlighting the nuanced nature of true model depth and paving the way for hybrid efficiency strategies.

Abstract

The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5\%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.

Leveraging the true depth of LLMs

TL;DR

This work introduces Layer Parallelism (LP), a post-hoc method that parallelizes consecutive Transformer layers to reduce depth and inter-GPU communication during LLM inference without retraining. By restructuring the computational graph to run adjacent layers in parallel, LP achieves substantial throughput gains while incurring only modest losses in perplexity and downstream task performance, with some recovery possible through lightweight fine-tuning. The authors provide empirical evidence across multiple models (e.g., Llama2-7B, Llama3-3B, Qwen-4B/14B) showing speedups up to around 1.46x on multi-GPU setups, and they offer a theoretical analysis of the LP approximation error and demonstrate limitations and trade-offs. Overall, LP offers a practical, scalable approach to accelerating LLM deployment, highlighting the nuanced nature of true model depth and paving the way for hybrid efficiency strategies.

Abstract

The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5\%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.

Paper Structure

This paper contains 39 sections, 14 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The effect of LP on execution time (4K tokens) and perplexity (measured against RedPajama together2023redpajama).
  • Figure 2: Comparison of a normal transformer block (a) with our layer parallel implementation (b). Divergent paths in (b) are split across the Tensor Parallel axis (Eq. \ref{['eq:lp']}).
  • Figure 3: Diagram of transformations applied in § \ref{['sec:effective-depth']}. Diagrams (a,b,c,d) represent shuffling, merging, pruning and parallel respectively.
  • Figure 4: Changes in perplexity when applying transformations on contiguous stretches of layers. Each of the five heatmaps above corresponds to a transformation of a group of consecutive layers, where the row index $s$ corresponds to the first layer of the group, and the column index $e$ to the last. The color coding indicates how the perplexity---estimated on a subset of RedPajama together2023redpajama---is impacted by the corresponding modification of the model. The perplexity for the base Llama 2 7B model is $6.2$. In (a), we shuffle---for each forward---the layers from $s$ to $e$. We can see that many consecutive layers can be shuffled with little impact on the overall perplexity. For instance, shuffling layers $15$ to $25$---$10$ layers in total---raises the perplexity only to $9.1$. In (b), we prune contiguous stretches of layers. We can see that not many blocks can be removed without starting to significantly degrade the perplexity. In (c) we merge contiguous layers. The results with merging are nearly identical to those for pruning. This reveals there is no advantage in merging layers, most likely a result of averaging matrices that originate from different initial values. In (d) we run contiguous blocks in parallel. Given the success of shuffling, it makes sense that this approach works well. Running blocks $17$ to $27$ raises the perplexity to $9.3$. Finally, in (e) we run pairs of consecutive layers in parallel. As a result, we can parallelize much longer stretches of layers. For instance, we can apply this transformation from layer $4$ to $29$ and only increase the perplexity to $9.1$. This reduces the depth of the model from $32$ to $19$. This result makes it possible for us to leverage this parallelism for faster inference as we discuss in § \ref{['sec:efficiency']}.
  • Figure 5: CKA similarity for Qwen3-4B between the original MHA/FFN activations and the counterfactual activations that exclude incoming residuals. Higher values imply greater invariance to the upstream residual stream. A plateau of high CKA similarity between pairs of layers is preceded by a sharp similarity decline at layer 16, which coincides with the performance degradations experienced when applying different levels of LP at different positions.
  • ...and 6 more figures