RecycleGPT: An Autoregressive Language Model with Recyclable Module
Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu, Kunpeng Wang, Wenlai Zhao, Guangwen Yang
TL;DR
RecycleGPT tackles the latency bottleneck in autoregressive decoding by introducing a recyclable module that predicts multiple upcoming tokens from previous states, reducing full-model invocations. The model is trained with a dual objective that combines the standard autoregressive loss with a dedicated recycle loss, and is demonstrated on a 1.3B Transformer with a $6$-layer recyclable module, incurring only a $15\%$ parameter increase. Empirical results show comparable performance to strong baselines while achieving up to $1.4\times$ decoding speedup, especially under an alternating decoding strategy that interleaves recyclable-module predictions with full-model runs. The approach is orthogonal to existing acceleration methods and adaptable to different pre-trained models, offering practical benefits for low-latency generation in large language models. Overall, RecycleGPT provides a simple, scalable path to faster decoding without sacrificing accuracy, enabling wider applicability of autoregressive LLMs in latency-constrained environments.
Abstract
Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Experiments and analysis demonstrate the effectiveness of our approach in lowering inference latency, achieving up to 1.4x speedup while preserving high performance.
