Table of Contents
Fetching ...

CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers

Longwei Zou, Qingyang Wang, Han Zhao, Jiangang Kong, Yi Yang, Yangdong Deng

TL;DR

CQIL introduces Concurrent Computation of Quasi-Independent Layers to reduce LLM inference latency by parallelizing computations across layers with similar inputs. A bypassing mechanism transmits selective attention outputs to mitigate information loss, enabling up to 48.3% latency reduction on LLaMA-33B with minimal performance degradation. The approach is orthogonal to existing efficiency methods like pruning and tensor parallelism and shows greater benefits for larger models, suggesting a pipeline-ensemble shift in layer functionality. Practical impact includes faster online inference and potential for combining CQIL with other acceleration strategies in multi-GPU environments.

Abstract

The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of layers. Recent works on reducing the cumulative latency through layer removing, however, lead to significant performance drop. Motivated by the similarity of inputs among adjacent layers, we propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency. We also introduce a bypassing technique to mitigate the effect of information loss. Empirical experiments of the proposed approach on the LLaMA models confirm that Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by up to 48.3% on LLaMA-33B, while maintaining a close level of performance.

CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers

TL;DR

CQIL introduces Concurrent Computation of Quasi-Independent Layers to reduce LLM inference latency by parallelizing computations across layers with similar inputs. A bypassing mechanism transmits selective attention outputs to mitigate information loss, enabling up to 48.3% latency reduction on LLaMA-33B with minimal performance degradation. The approach is orthogonal to existing efficiency methods like pruning and tensor parallelism and shows greater benefits for larger models, suggesting a pipeline-ensemble shift in layer functionality. Practical impact includes faster online inference and potential for combining CQIL with other acceleration strategies in multi-GPU environments.

Abstract

The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of layers. Recent works on reducing the cumulative latency through layer removing, however, lead to significant performance drop. Motivated by the similarity of inputs among adjacent layers, we propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency. We also introduce a bypassing technique to mitigate the effect of information loss. Empirical experiments of the proposed approach on the LLaMA models confirm that Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by up to 48.3% on LLaMA-33B, while maintaining a close level of performance.
Paper Structure (33 sections, 5 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Similarity of inputs across layers in LLaMA-1 models. Sub-figure with title "2.7B" represents the similarity of inputs in Sheared-LLaMA-2.7Bsheared-llama. It highlights that adjacent layers have highly similar input. Notably, such similarity of inputs becomes increasingly evident in larger models at deeper layers, suggesting the quasi-independence of deeper layers and the potential for parallel computation.
  • Figure 2: Sensitivity of Input Substitution. We individually replace the input of layer $l$ with that of the layer $l-k$ and evaluate the perplexity. A darker block indicates a higher perplexity and diminished performance. When $k \ge l$, there is no corresponding layer for $l-k$, therefore these parts are left blank in the figure. The original perplexity is around 6. The drawing shows that both bottom and top layers (bottom refers to the direction close to the embedding layer) are sensitive to the input substitution, whereas the majority of middle layers are relatively insensitive.
  • Figure 3: The proposed method. (a-c) depict the pipeline inference as well as the CQIL with and without the bypassing technique. The pipeline inference represents the standard setup, where layers are processed sequentially. In contrast, CQIL substitutes the input for layer $2$ with that of layer $1$, enabling concurrent computation across two GPUs for latency reduction. Given that both layers produce attention outputs concurrently, and the attention output of layer $1$ serves as an input for layer $2$ in the original model, the bypassing technique transmits the attention output from GPU1 to GPU2, minimizes the information loss and improves the model performance.
  • Figure 4: Comparison with Tensor Parallelism. CQIL achieves consistent latency reduction on all batch sizes, benefiting the online inference.
  • Figure 5: MMLU (zero-shot) comparison with similarity based pruning methods. Dashed gray line represents the random guessing score.
  • ...and 3 more figures