Table of Contents
Fetching ...

CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality

Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu

TL;DR

CopySpec introduces a rolling-hash based copying mechanism over the last $\c\gamma$ tokens to opportunistically copy following text from context, accelerating LLM inference without extra GPU memory. By integrating with speculative decoding, CopySpec can either supply copied blocks or rely on a draft model, yielding additional throughput gains (up to $ abla$% in various setups) while preserving output quality. The method is validated across five LLMs and multiple datasets, including a new MT-Redundant benchmark that simulates variations to prior outputs, and demonstrates substantial speedups, especially in multi-turn, redundancy-rich tasks. The approach is lightweight, modular, and complementary to external drafting frameworks, offering practical improvements for real-time AI systems and multi-turn conversations.

Abstract

We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model's chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn's answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K's self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.

CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality

TL;DR

CopySpec introduces a rolling-hash based copying mechanism over the last tokens to opportunistically copy following text from context, accelerating LLM inference without extra GPU memory. By integrating with speculative decoding, CopySpec can either supply copied blocks or rely on a draft model, yielding additional throughput gains (up to % in various setups) while preserving output quality. The method is validated across five LLMs and multiple datasets, including a new MT-Redundant benchmark that simulates variations to prior outputs, and demonstrates substantial speedups, especially in multi-turn, redundancy-rich tasks. The approach is lightweight, modular, and complementary to external drafting frameworks, offering practical improvements for real-time AI systems and multi-turn conversations.

Abstract

We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model's chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn's answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K's self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.

Paper Structure

This paper contains 51 sections, 3 equations, 18 figures, 22 tables.

Figures (18)

  • Figure 1: An example of redundant information, represented by blocks of the same color, that can be directly copied during inference without re-computation. This highlights the potential of our approach to make inference more efficient by leveraging repeated information, reducing computational overhead, and improving speed.
  • Figure 2: The figure illustrates the speculative copying process, CopySpec applied to extract the habitat description of the "wood duck.". The input text provides the context and instructions. During generation, the system identifies sequences of 3 consecutive tokens (we use words as tokens here for illustrative simplicity) that repeat within the input. The blue rectangle in the input highlights the matching token sequence detected, which serves as the starting point for speculative copying. From this match, the next 10 tokens are copied into the output. In the output, the copied tokens are shown in blue and validated through speculative copying. Tokens accepted by the model are highlighted in green, continuing the description seamlessly, while rejected tokens are shown in red with a strikethrough. Extra tokens generated during the validation process are marked in yellow/gold, demonstrating how the model extends the copied content as needed. This figure demonstrates how CopySpec efficiently leverages repeated sequences to enhance text generation accuracy and speed by integrating both copied and dynamically generated content.
  • Figure 3: This figure shows how the copying parameter $\gamma$ affects HumanEval performance using LLaMa3.1-8B-Instruct. The solid red line indicates tokens per second (TPS) with standard deviation shading; the dashed red line marks baseline TPS. The blue line shows the percentage of successfully copied tokens, with adjacent numbers indicating copying attempts.
  • Figure 4: Average accepted tokens per copy attempt against $\gamma$ using LLaMa-8B ($|S_{\text{copyspec}}| = 10$), showcasing the correlation between $\gamma$ and the accepted tokens.
  • Figure 5: We use Qwen2.5-7B on both MT-Bench and MT-Redundant dataset. Cosine Similarity and Tokens per Second trends as a function of $\gamma$. The blue line indicates the Cosine Similarity, showing semantic alignment across varying $\gamma$-token contexts. The red line illustrates the Tokens per Second, reflecting generation speed. $\gamma$ denotes the number of tokens considered in the context for each measurement. The left plot shows MT-Bench, and the right plot shows MT-Redundant.
  • ...and 13 more figures