Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Yao Zhao, Zhitian Xie, Chen Liang, Chenyi Zhuang, Jinjie Gu
TL;DR
This work addresses production-scale LLM inference latency by identifying IO bandwidth as the primary bottleneck and proposing Lookahead, a training-free, model-free acceleration framework. Lookahead uses a trie-based retrieval to assemble multi-branch drafts and a parallel Verification and Accept (VA) process to ensure lossless generation, achieving longer effective decoding lengths without compromising accuracy. The approach leverages GPU FLOPs redundancy by increasing tokens-per-step, and introduces hierarchical multi-branch drafts to efficiently reuse shared prefixes. Extensive experiments on industry and open datasets show substantial speedups (up to ~6×) with modest memory overhead, and the framework has been deployed in Alipay since 2023 across multiple applications, demonstrating practical impact. The authors also provide open-source code to encourage adoption and adaptation to new LLMs.
Abstract
As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our LLM-based scenarios, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a Trie-based retrieval and verification mechanism to be able to accept several tokens at a forward step. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Our framework is widely deployed in Alipay since April 2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at https://github.com/alipay/PainlessInferenceAcceleration.
