Table of Contents
Fetching ...

Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum

TL;DR

Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speed, so Amphista, an enhanced speculative decoding framework that builds upon Medusa is introduced.

Abstract

Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speed. While methods such as Medusa constructs parallelized heads, they lack adequate information interaction across different prediction positions. To overcome this limitation, we introduce Amphista, an enhanced speculative decoding framework that builds upon Medusa. Specifically, Amphista models an Auto-embedding Block capable of parallel inference, incorporating bi-directional attention to enable interaction between different drafting heads. Additionally, Amphista integrates Staged Adaptation Layers, which ensure a seamless transition of semantic information from the target model's autoregressive inference to the drafting heads' non-autoregressive inference, effectively achieving paradigm shift and feature fusion. Experimental results on Vicuna models using MT-Bench and Spec-Bench demonstrate that Amphista achieves substantial acceleration while maintaining generation quality. On MT-Bench, Amphista delivers up to 2.75$\times$ speedup over vanilla autoregressive decoding and 1.40$\times$ over Medusa on Vicuna 33B in wall-clock time.

Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference

TL;DR

Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speed, so Amphista, an enhanced speculative decoding framework that builds upon Medusa is introduced.

Abstract

Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speed. While methods such as Medusa constructs parallelized heads, they lack adequate information interaction across different prediction positions. To overcome this limitation, we introduce Amphista, an enhanced speculative decoding framework that builds upon Medusa. Specifically, Amphista models an Auto-embedding Block capable of parallel inference, incorporating bi-directional attention to enable interaction between different drafting heads. Additionally, Amphista integrates Staged Adaptation Layers, which ensure a seamless transition of semantic information from the target model's autoregressive inference to the drafting heads' non-autoregressive inference, effectively achieving paradigm shift and feature fusion. Experimental results on Vicuna models using MT-Bench and Spec-Bench demonstrate that Amphista achieves substantial acceleration while maintaining generation quality. On MT-Bench, Amphista delivers up to 2.75 speedup over vanilla autoregressive decoding and 1.40 over Medusa on Vicuna 33B in wall-clock time.
Paper Structure (22 sections, 9 equations, 7 figures, 10 tables)

This paper contains 22 sections, 9 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Top-1/5 accuracy for different heads of Medusa and Amphista. We perform testing with randomly sampled 5% ShareGPT conversation data. Amphista far outperforms Medusa in terms of head accuracy, especially for the latter two heads.
  • Figure 2: The Framework of Amphista Decoding. Our method improve Medusa in two folds: (1) We introduce staged adaptation layers, consisting of a group of causal Transformer Decoder layers built upon the target model, to adapt the target model's hidden states and the sampled token in two stages. This module ensures that the adapted features contain richer contextual information, supporting multiple-token predictions rather than focusing solely on the immediate next-token prediction. (2) We introduce an auto-embedding block, which is a bi-directional Transformer Encoder module with positional encoding. This block allows each head to attend to others, fostering cooperative predictions and thereby enhancing the speculative accuracy during the drafting stage.
  • Figure 3: Throughput (tokens/s) on MT-Bench with different target model sizes and temperatures.
  • Figure 4: Draft tree used in Medusa, Hydra and our Amphista.
  • Figure 5: An Illustration of Tree Attention. Assuming Medusa has only 2 heads, where head-1 generates the top-2 tokens and head-2 generates the top-3 tokens, resulting in 6 candidate sequences (e.g., ABD). Additionally, a special tree mask is designed to ensure causal relationships among the top-k nodes of each head.
  • ...and 2 more figures