Table of Contents
Fetching ...

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Yao Zhao, Zhitian Xie, Chen Liang, Chenyi Zhuang, Jinjie Gu

TL;DR

This work addresses production-scale LLM inference latency by identifying IO bandwidth as the primary bottleneck and proposing Lookahead, a training-free, model-free acceleration framework. Lookahead uses a trie-based retrieval to assemble multi-branch drafts and a parallel Verification and Accept (VA) process to ensure lossless generation, achieving longer effective decoding lengths without compromising accuracy. The approach leverages GPU FLOPs redundancy by increasing tokens-per-step, and introduces hierarchical multi-branch drafts to efficiently reuse shared prefixes. Extensive experiments on industry and open datasets show substantial speedups (up to ~6×) with modest memory overhead, and the framework has been deployed in Alipay since 2023 across multiple applications, demonstrating practical impact. The authors also provide open-source code to encourage adoption and adaptation to new LLMs.

Abstract

As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our LLM-based scenarios, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a Trie-based retrieval and verification mechanism to be able to accept several tokens at a forward step. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Our framework is widely deployed in Alipay since April 2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at https://github.com/alipay/PainlessInferenceAcceleration.

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

TL;DR

This work addresses production-scale LLM inference latency by identifying IO bandwidth as the primary bottleneck and proposing Lookahead, a training-free, model-free acceleration framework. Lookahead uses a trie-based retrieval to assemble multi-branch drafts and a parallel Verification and Accept (VA) process to ensure lossless generation, achieving longer effective decoding lengths without compromising accuracy. The approach leverages GPU FLOPs redundancy by increasing tokens-per-step, and introduces hierarchical multi-branch drafts to efficiently reuse shared prefixes. Extensive experiments on industry and open datasets show substantial speedups (up to ~6×) with modest memory overhead, and the framework has been deployed in Alipay since 2023 across multiple applications, demonstrating practical impact. The authors also provide open-source code to encourage adoption and adaptation to new LLMs.

Abstract

As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our LLM-based scenarios, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a Trie-based retrieval and verification mechanism to be able to accept several tokens at a forward step. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Our framework is widely deployed in Alipay since April 2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at https://github.com/alipay/PainlessInferenceAcceleration.
Paper Structure (29 sections, 2 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 29 sections, 2 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Decoding length's impact on the overall consuming time of LLMs' single forward process. Even the forward FLOPs is linear to the decoding length,
  • Figure 2: Overview of the drafts retrieving and the Verification and Accept (VA) process using various strategies.
  • Figure 3: The input ids, position ids and causal masks for forwarding using various strategies.
  • Figure 4: The decoding and branch length's impact on the LLM's inference speed using various accelerations.
  • Figure 5: The decoding and branch length's impact on the effective decoding length, EDL using various accelerations.
  • ...and 2 more figures