Table of Contents
Fetching ...

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao

TL;DR

This work tackles the latency bottleneck of autoregressive large language models by enabling lossless semi-autoregressive (SAR) decoding through Bi-directional Tuning (BiTA), which adds a tiny set of trainable prompt and mask embeddings to frozen models. A tree-based decoding mechanism allows simultaneous draft generation and verification within a single forward pass, removing the need for external verification models. BiTA demonstrates 2.1x–3.3x speedups across multiple models and tasks, including 2.7x on MT-Bench for LLaMA-2-70B-Chat, with only ~0.01%–0.06% of parameters trained. The approach is plug-and-play, hardware-efficient, and surpasses state-of-the-art speculative decoding methods, offering a practical path to faster real-time LLM inference on edge and data-center deployments.

Abstract

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

TL;DR

This work tackles the latency bottleneck of autoregressive large language models by enabling lossless semi-autoregressive (SAR) decoding through Bi-directional Tuning (BiTA), which adds a tiny set of trainable prompt and mask embeddings to frozen models. A tree-based decoding mechanism allows simultaneous draft generation and verification within a single forward pass, removing the need for external verification models. BiTA demonstrates 2.1x–3.3x speedups across multiple models and tasks, including 2.7x on MT-Bench for LLaMA-2-70B-Chat, with only ~0.01%–0.06% of parameters trained. The approach is plug-and-play, hardware-efficient, and surpasses state-of-the-art speculative decoding methods, offering a practical path to faster real-time LLM inference on edge and data-center deployments.

Abstract

Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7 speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.
Paper Structure (28 sections, 1 equation, 10 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 1 equation, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: A comparison of LLM acceleration techniques, encompassing both state-of-the-art methods and our approach, is presented on MT-Bench using various base models. The speedup numbers are either sourced from the respective papers or reproduced using the released source codes in a standardized hardware environment by us, in cases where explicit disclosure is not provided.
  • Figure 2: A diagram of bi-directional tuning, orange blocks [M] for trainable mask tokens, purple blocks [P] for trainable prompt tokens, and blue blocks for transformer layers in frozen LLM. The predicted SAR future tokens are generated with the joint influence of frozen LLM parameters, prompt tokens, and mask tokens. For illustration purposes, we set the count of prompt and mask tokens to be 3.
  • Figure 3: An illustrative example of the attention mask employed in bi-directional tuning. The "1" indicates activation, while "blank" signifies suppression in attention mechanism. The shown example is derived from the sentence in Figure \ref{['fig:BiT-train']}.
  • Figure 4: A simple example illustrates the straightforward streamlined generation and verification. Input query, namely "$<$s$>$ Have you heard about LLMs?", serves as initial input token sequence $X^{0}$. The draft token candidates $\hat{C}^{i}$ are enclosed in orange dashed boxes. A successful acceptance is marked by a "check"; otherwise, it is marked by a "cross". If $c$ draft tokens are accepted, the first $c$ mask tokens would be discarded as they are no longer necessary. If a draft candidate is rejected, its prediction, along with its subsequent tokens, is discarded (denoted as "$/$"). In these four forward passes, the model produces 1, 4, 1, and 3 output tokens, respectively.
  • Figure 5: The efficient draft candidate token tree. The configuration includes 4 mask tokens, and for the prediction of each mask token, the top 3 draft candidates are selected. As shown, only the top-1 scoring word has subsequent words for verification.
  • ...and 5 more figures