Table of Contents
Fetching ...

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

TL;DR

The paper tackles latency in autoregressive LLM decoding by enabling learned asynchronous decoding through semantic independence. It introduces Pasta-Lang, an annotation language, and a decoding-time interpreter, paired with a two-stage finetuning pipeline and BoNBoN-style optimization to maximize speed and quality. On AlpacaEval, Pasta achieves Pareto-dominant speedups (1.21x–1.93x) with controlled quality trade-offs, outperforming hand-crafted approaches like APAR and SoT. The work analyzes design choices such as position ID prediction and scoring metrics, showing scalable improvements without observed saturation across iterative refinements.

Abstract

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

TL;DR

The paper tackles latency in autoregressive LLM decoding by enabling learned asynchronous decoding through semantic independence. It introduces Pasta-Lang, an annotation language, and a decoding-time interpreter, paired with a two-stage finetuning pipeline and BoNBoN-style optimization to maximize speed and quality. On AlpacaEval, Pasta achieves Pareto-dominant speedups (1.21x–1.93x) with controlled quality trade-offs, outperforming hand-crafted approaches like APAR and SoT. The work analyzes design choices such as position ID prediction and scoring metrics, showing scalable improvements without observed saturation across iterative refinements.

Abstract

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.
Paper Structure (17 sections, 1 equation, 9 figures)

This paper contains 17 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: Example response from a Pasta model executed by the Pasta-Lang interpreter. The interpreter begins with only the main thread. It first decodes 1, and it creates an asynchronous decoding thread, which decodes 4 in red. In parallel, the main thread decodes 2. It creates another asynchronous decoding thread, which contains both the <promise/> tag on coordinates extraction and the <promise/> tag on length formula in its prefix, and decodes 5 in green. The main thread continues decoding in parallel to both threads to get 3. It wait at this point until all other threads complete. The interpreter then inserts each asynchronous content after their corresponding <promise/> tags. Finally, the interpreter decodes 6, with both of the asynchronously decoded content in the prefix.
  • Figure 2: Details for efficient Pasta-Lang interpreter implementation. Color shows the identity of the decoding thread (purple=main, red=Fork#1, green=Fork#2); orange denotes interpreter-inserted tokens.
  • Figure 3: Pasta-Lang dataset creation and model training.
  • Figure 4: Left (Realized Speedup). Pasta models achieved Pareto-optimal quality-speedup trade-off than asynchronous decoding strategies with hand-crafted heuristics. Middle (Theoretical Speedup). The realized speedup using Pasta-Lang interpreter is close to the theoretical speedup. Right (Theoretical Parallelism).Pasta responses show high degree of parallelism.
  • Figure 5: Left (Scalability). As we continue investing training compute by increasing the number of rounds of preference optimization, we see the quality-latency trade-off continuously improve. Middle (Positional Embedding). Analysis of different methods for computing position IDs during decoding. LLM based prediction of the position IDs in multiples of ten (Pred-10x) achieves the highest quality without significantly sacrificing speedup. Right (Preference Score). Analysis of different metrics for decoding efficiency used in calculating the Pasta-Lang preference scores. Optimizing for the theoretical speedup achieves both high theoretical speedup and LC win rate.
  • ...and 4 more figures