Table of Contents
Fetching ...

Efficient Large Language Model Inference with Neural Block Linearization

Mete Erdogan, Francesco Tonin, Volkan Cevher

TL;DR

This work tackles the high inference cost of transformer-based LLMs by introducing Neural Block Linearization (NBL), which replaces selected self-attention layers with closed-form linear estimators learned via Linear Minimum Mean Squared Error (LMMSE). A Canonical Correlation Analysis (CCA) based bound quantifies the potential accuracy loss from linearization and guides layer substitution by ranking layers with the smallest bound, enabling calibration-free compression of pre-trained models. Empirically, NBL yields significant speedups (e.g., up to 32% in certain configurations) while maintaining competitive accuracy across multiple models and reasoning benchmarks, and it remains effective when combined with post-training quantization (AWQ) and speculative decoding. The approach provides a scalable, interpretable path toward deploying large LLMs in resource-constrained environments, with robust ablations and extensions to larger models and hardware-aware settings.

Abstract

The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.

Efficient Large Language Model Inference with Neural Block Linearization

TL;DR

This work tackles the high inference cost of transformer-based LLMs by introducing Neural Block Linearization (NBL), which replaces selected self-attention layers with closed-form linear estimators learned via Linear Minimum Mean Squared Error (LMMSE). A Canonical Correlation Analysis (CCA) based bound quantifies the potential accuracy loss from linearization and guides layer substitution by ranking layers with the smallest bound, enabling calibration-free compression of pre-trained models. Empirically, NBL yields significant speedups (e.g., up to 32% in certain configurations) while maintaining competitive accuracy across multiple models and reasoning benchmarks, and it remains effective when combined with post-training quantization (AWQ) and speculative decoding. The approach provides a scalable, interpretable path toward deploying large LLMs in resource-constrained environments, with robust ablations and extensions to larger models and hardware-aware settings.

Abstract

The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.

Paper Structure

This paper contains 50 sections, 2 theorems, 56 equations, 7 figures, 22 tables, 2 algorithms.

Key Result

Proposition 3.1

kay1993fundamentals. The Linear Minimum Mean Squared Error (LMMSE) estimator defines the optimal linear relationship between the vector-valued random variables $X$ and $Y$ by the following weight $W$ and bias $b$: where $C_{XX}$ is the covariance of $X$ and $C_{YX}$ is the cross-covariance between $Y$ and $X$.

Figures (7)

  • Figure 1: Illustration of Neural Block Linearization (NBL), which replaces a multi-head attention layer with an efficient linear layer using the closed-form LMMSE estimator.
  • Figure 2: Illustration of layer selection in the NBL method, guided by the CCA-based bound from the \ref{['thm: cca_eig']}, as applied to (a) Mistral-7B and (b) Llama-3.1-8B models.
  • Figure 3: Calibration runtime scaling with model size.
  • Figure 4: Prefill speed-up of Llama-3.1-8B with varying context lengths. NBL values are normalized by the baseline.
  • Figure 5: Speculative decoding + NBL speed-ups on DeepSeek-R1-Distill-Llama-8B (MT-bench zheng2023judging, A100).
  • ...and 2 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • Theorem 3.2
  • proof
  • proof