Streaming Looking Ahead with Token-level Self-reward

Hongming Zhang; Ruixin Hong; Dong Yu

Streaming Looking Ahead with Token-level Self-reward

Hongming Zhang, Ruixin Hong, Dong Yu

TL;DR

The paper tackles the inefficiency of external reward-model–based lookahead in streaming LLM decoding by introducing a Reward Transformer that performs token-level reward modeling (TRM) within the model itself. It proposes streaming lookahead (SLA), which enables fine-grained, token-level lookahead with near-zero additional communication overhead, making lookahead feasible in streaming scenarios. Empirical results show SLA achieves a $79.7\%$ win rate over greedy decoding with a frozen policy and $89.4\%$ when combined with reinforcement fine-tuning (e.g., DPO), demonstrating strong performance gains across general-domain tasks. The work also introduces the AuTRC metric to assess TRM quality and demonstrates that distributing reward capacity across transformer layers yields better TRM performance than adapter-based approaches. Overall, this approach reduces reliance on external reward models, improves streaming generation quality, and broadens the applicability of lookahead search in large-scale language models.

Abstract

Autoregressive decoding algorithms that use only past information often cannot guarantee the best performance. Recently, people discovered that looking-ahead algorithms such as Monte Carlo Tree Search (MCTS) with external reward models (RMs) can significantly improve models' output by allowing them to think ahead and leverage future outputs and associated rewards to guide the current generation. Such techniques can help the reinforcement fine-tuning phase by sampling better trajectories and the inference phase by selecting the better output. However, their high computational cost limits their applications, especially in streaming scenarios. To address this issue, we propose equipping the policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication. We name the new architecture as Reward Transformer. In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization. Experiments show that SLA achieves an overall win rate of 79.7\% against the baseline greedy decoding algorithm on three general-domain datasets with a frozen policy model while maintaining streaming efficiency. If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4\%.

Streaming Looking Ahead with Token-level Self-reward

TL;DR

Abstract

Streaming Looking Ahead with Token-level Self-reward

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)