Leveraging the true depth of LLMs
Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
TL;DR
This work introduces Layer Parallelism (LP), a post-hoc method that parallelizes consecutive Transformer layers to reduce depth and inter-GPU communication during LLM inference without retraining. By restructuring the computational graph to run adjacent layers in parallel, LP achieves substantial throughput gains while incurring only modest losses in perplexity and downstream task performance, with some recovery possible through lightweight fine-tuning. The authors provide empirical evidence across multiple models (e.g., Llama2-7B, Llama3-3B, Qwen-4B/14B) showing speedups up to around 1.46x on multi-GPU setups, and they offer a theoretical analysis of the LP approximation error and demonstrate limitations and trade-offs. Overall, LP offers a practical, scalable approach to accelerating LLM deployment, highlighting the nuanced nature of true model depth and paving the way for hybrid efficiency strategies.
Abstract
The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5\%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.
