Table of Contents
Fetching ...

RecycleGPT: An Autoregressive Language Model with Recyclable Module

Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu, Kunpeng Wang, Wenlai Zhao, Guangwen Yang

TL;DR

RecycleGPT tackles the latency bottleneck in autoregressive decoding by introducing a recyclable module that predicts multiple upcoming tokens from previous states, reducing full-model invocations. The model is trained with a dual objective that combines the standard autoregressive loss with a dedicated recycle loss, and is demonstrated on a 1.3B Transformer with a $6$-layer recyclable module, incurring only a $15\%$ parameter increase. Empirical results show comparable performance to strong baselines while achieving up to $1.4\times$ decoding speedup, especially under an alternating decoding strategy that interleaves recyclable-module predictions with full-model runs. The approach is orthogonal to existing acceleration methods and adaptable to different pre-trained models, offering practical benefits for low-latency generation in large language models. Overall, RecycleGPT provides a simple, scalable path to faster decoding without sacrificing accuracy, enabling wider applicability of autoregressive LLMs in latency-constrained environments.

Abstract

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Experiments and analysis demonstrate the effectiveness of our approach in lowering inference latency, achieving up to 1.4x speedup while preserving high performance.

RecycleGPT: An Autoregressive Language Model with Recyclable Module

TL;DR

RecycleGPT tackles the latency bottleneck in autoregressive decoding by introducing a recyclable module that predicts multiple upcoming tokens from previous states, reducing full-model invocations. The model is trained with a dual objective that combines the standard autoregressive loss with a dedicated recycle loss, and is demonstrated on a 1.3B Transformer with a -layer recyclable module, incurring only a parameter increase. Empirical results show comparable performance to strong baselines while achieving up to decoding speedup, especially under an alternating decoding strategy that interleaves recyclable-module predictions with full-model runs. The approach is orthogonal to existing acceleration methods and adaptable to different pre-trained models, offering practical benefits for low-latency generation in large language models. Overall, RecycleGPT provides a simple, scalable path to faster decoding without sacrificing accuracy, enabling wider applicability of autoregressive LLMs in latency-constrained environments.

Abstract

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Experiments and analysis demonstrate the effectiveness of our approach in lowering inference latency, achieving up to 1.4x speedup while preserving high performance.
Paper Structure (18 sections, 7 equations, 3 figures, 5 tables)

This paper contains 18 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Model architecture of standard GPT and RecycleGPT.
  • Figure 2: Illustration of the difference between standard autoregressive decoding and autoregressive decoding using a recyclable module.. The orange block indicates one forward call of the whole language model while the green one indicates the call of the recyclable module. The amount of computation and memory footprint required by the green part is far less than that of the orange part. When using an alternating decoding strategy, we see that the recyclable module can save a significant amount of time. The yellow block indicates the final output classifier.
  • Figure 3: Training loss over train tokens.