Table of Contents
Fetching ...

MARS: Enabling Autoregressive Models Multi-Token Generation

Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun

Abstract

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

MARS: Enabling Autoregressive Models Multi-Token Generation

Abstract

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

Paper Structure

This paper contains 24 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Example MARS generation on GSM8K. The model adaptively generates 1--4 tokens per forward pass based on confidence: predictable continuations are batched together, while novel content proceeds token-by-token. This achieves 2.55$\times$ token per forward over standard AR decoding.
  • Figure 2: MARS attention mask and inference for $L{=}8$, $B{=}4$. Left: training mask with $[\mathbf{x} \mid \tilde{\mathbf{x}}]$ concatenation. The orange cells show that noisy positions attend to each other causally within each block, in contrast to Block Diffusion arriolablock which uses bidirectional attention within blocks. Right: sliding-window inference. The dashed line marks the generation cursor; $B$[MASK] tokens are appended and filled via one forward pass. Accepted tokens (blue) slide into the prefix for the next step.
  • Figure 3: Speed--quality Pareto curves on GSM8K (left) and HumanEval (right). Solid lines: MARS (with SFT loss). Dashed lines: w/o SFT loss. Dotted: AR SFT baseline. With SFT loss, MARS dominates at every operating point on both tasks.
  • Figure 4: Block-level KV cache for batch inference ($B_{\text{cache}}{=}4$, batch size 3). Each step forwards $B$[MASK] tokens against the cached prefix. The cache advances by the minimum number of tokens accepted across all samples: after Step 1 (S1 accepts 4, S2 accepts 2, S3 accepts 1), one token is cached (green). S1 idles while S2 and S3 continue. Once all samples fill the block, the entire block is cached (yellow) and new [MASK] tokens are appended for the next block.
  • Figure 5: Speed--quality trade-off under three acceptance metrics (MARS $B{=}4$, GSM8K). All three metrics trace similar Pareto frontiers, indicating that the speed--quality trade-off is robust to the choice of acceptance criterion. Entropy and top-2 margin degrade slightly more gracefully than raw probability at comparable tokens per forward pass.