Reinforcement Speculative Decoding for Fast Ranking

Yingpeng Du; Tianjun Wei; Zhu Sun; Jie Zhang

Reinforcement Speculative Decoding for Fast Ranking

Yingpeng Du, Tianjun Wei, Zhu Sun, Jie Zhang

TL;DR

This work addresses latency and tail-performance challenges in LLM-based ranking for IR/RS by introducing Reinforcement Speculative Decoding (RSD), an up-to-down decoding paradigm that iteratively refines rankings within a fixed budget $T$. It combines a listwise-aware relevance network with reinforcement learning to learn a ranking-modification policy, using a Spearman-distance-based return and a reference-model baseline to ensure unbiased, low-variance updates. The method avoids fine-tuning the target LLM, reduces decoding cost relative to full autoregressive decoding, and demonstrates strong improvements over STD and SD baselines on MS MARCO, Quora, ML-1M, and Amazon-Games, with detailed ablations supporting the importance of RPO and listwise modeling. The results suggest significant practical impact for real-world ranking systems, though the approach faces challenges in cross-domain generalization and scaling to very large candidate sets.

Abstract

Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement Speculative Decoding method for fast ranking inference of LLMs. To meet the ranking systems' latency requirement, we propose an up-to-down decoding paradigm that employs an agent to iteratively modify the ranking sequence under a constrained budget. Specifically, we design a ranking-tailored policy optimization, actively exploring optimal multi-round ranking modification policy verified by LLMs via reinforcement learning (RL). To better approximate the target LLM under the constrained budget, we trigger the agent fully utilizing the listwise ranking knowledge about all items verified by LLMs across different rounds in RL, enhancing the modification policy of the agent. More importantly, we demonstrate the theoretical robustness and advantages of our paradigm and implementation. Experiments on both IR and RS tasks show the effectiveness of our proposed method.

Reinforcement Speculative Decoding for Fast Ranking

TL;DR

Abstract

Reinforcement Speculative Decoding for Fast Ranking

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (13)