BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
TL;DR
This work tackles the latency bottleneck of autoregressive large language models by enabling lossless semi-autoregressive (SAR) decoding through Bi-directional Tuning (BiTA), which adds a tiny set of trainable prompt and mask embeddings to frozen models. A tree-based decoding mechanism allows simultaneous draft generation and verification within a single forward pass, removing the need for external verification models. BiTA demonstrates 2.1x–3.3x speedups across multiple models and tasks, including 2.7x on MT-Bench for LLaMA-2-70B-Chat, with only ~0.01%–0.06% of parameters trained. The approach is plug-and-play, hardware-efficient, and surpasses state-of-the-art speculative decoding methods, offering a practical path to faster real-time LLM inference on edge and data-center deployments.
Abstract
Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.
