Table of Contents
Fetching ...

Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding

Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen Liu, Ruiming Tang, Zhewei Wei, Weinan Zhang, Yong Yu

TL;DR

LASER tackles the latency challenge of deploying LLMs in large-scale recommender systems by introducing retrieval-based speculative decoding with two key enhancements: Customized Retrieval Pool and Relaxed Verification. The method builds compact, group-specific retrieval pools and uses tree-based drafting with parallel verification to generate knowledge more efficiently while preserving downstream performance. Across public datasets and multiple RS frameworks, LASER delivers 3-5x speedups and substantial computational savings, with negligible impact on recommendation quality in offline and online deployments. The work demonstrates a practical path to scalable LLM-enabled RSs and suggests broader applicability to knowledge generation in information retrieval settings.

Abstract

The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers, generating augmented knowledge for downstream tasks. However, in recommendation scenarios with numerous users and items, even offline knowledge generation with LLMs demands significant time and computational resources. This inefficiency arises from the autoregressive nature of LLMs. A promising solution is speculative decoding, a Draft-Then-Verify approach that increases the number of tokens generated per decoding step. In this work, we first identify recommendation knowledge generation as a highly fitting use case for retrieval-based speculative decoding. Then, we discern its two characteristics: (1) the vast number of items and users in RSs leads to retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for LLM-generated text. Building on these insights, we introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER), which features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67\% of computational resources during the online A/B test on a large-scale advertising scenario with lossless downstream recommendation performance. Our code is available at https://github.com/YunjiaXi/LASER

Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding

TL;DR

LASER tackles the latency challenge of deploying LLMs in large-scale recommender systems by introducing retrieval-based speculative decoding with two key enhancements: Customized Retrieval Pool and Relaxed Verification. The method builds compact, group-specific retrieval pools and uses tree-based drafting with parallel verification to generate knowledge more efficiently while preserving downstream performance. Across public datasets and multiple RS frameworks, LASER delivers 3-5x speedups and substantial computational savings, with negligible impact on recommendation quality in offline and online deployments. The work demonstrates a practical path to scalable LLM-enabled RSs and suggests broader applicability to knowledge generation in information retrieval settings.

Abstract

The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers, generating augmented knowledge for downstream tasks. However, in recommendation scenarios with numerous users and items, even offline knowledge generation with LLMs demands significant time and computational resources. This inefficiency arises from the autoregressive nature of LLMs. A promising solution is speculative decoding, a Draft-Then-Verify approach that increases the number of tokens generated per decoding step. In this work, we first identify recommendation knowledge generation as a highly fitting use case for retrieval-based speculative decoding. Then, we discern its two characteristics: (1) the vast number of items and users in RSs leads to retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for LLM-generated text. Building on these insights, we introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER), which features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67\% of computational resources during the online A/B test on a large-scale advertising scenario with lossless downstream recommendation performance. Our code is available at https://github.com/YunjiaXi/LASER
Paper Structure (30 sections, 8 equations, 6 figures, 6 tables)

This paper contains 30 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Pipeline of retrieval-based speculative decoding for RSs and speedup of autoregressive decoding (Vanilla), naive retrieval-based speculative decoding (ReSD), and LASER.
  • Figure 2: The impact of retrieval pool size.
  • Figure 3: Comparison between naive retrieval-based speculative decoding ReSD (above), and our LASER (below). Here, we take users as examples, and the process is applicable to items. Note that the retrieved tree-structured draft is converted into a pseudo-sequence for parallel validation, which will be detailed in Section \ref{['sec:tree-based-draft']}.
  • Figure 4: Tree attention.
  • Figure 5: Comparison with speculative decoding methods.
  • ...and 1 more figures