Table of Contents
Fetching ...

Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

Ziyang Zeng, Heming Jing, Jindong Chen, Xiangli Li, Hongyu Liu, Yixuan He, Zhengyu Li, Yige Sun, Zheyong Xie, Yuqing Yang, Shaosheng Cao, Jun Fan, Yi Wu, Yao Hu

TL;DR

The paper tackles open-domain search relevance by moving from traditional scalar scoring to grounded, multi-step reasoning trained with reinforcement learning. It introduces Criteria-augmented prompts and Stepwise Advantage Masking (SAM) to provide step-level credit assignment, supported by a distillation-based warm-up and group-relative RL objectives. Empirical results on RANDOM and LONGTAIL benchmarks show that ProcessRL-Reasoning outperforms SFT and other RL baselines, with notable data efficiency and strong online deployment performance via a lightweight student model. The work demonstrates practical significance for industrial search, achieving improved user engagement and ranking quality while maintaining production efficiency, and outlines future directions for dynamic criteria and LLM-based verifiers.

Abstract

Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive experiments on industrial datasets, along with online A/B tests, demonstrate the effectiveness of our approach.

Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

TL;DR

The paper tackles open-domain search relevance by moving from traditional scalar scoring to grounded, multi-step reasoning trained with reinforcement learning. It introduces Criteria-augmented prompts and Stepwise Advantage Masking (SAM) to provide step-level credit assignment, supported by a distillation-based warm-up and group-relative RL objectives. Empirical results on RANDOM and LONGTAIL benchmarks show that ProcessRL-Reasoning outperforms SFT and other RL baselines, with notable data efficiency and strong online deployment performance via a lightweight student model. The work demonstrates practical significance for industrial search, achieving improved user engagement and ranking quality while maintaining production efficiency, and outlines future directions for dynamic criteria and LLM-based verifiers.

Abstract

Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive experiments on industrial datasets, along with online A/B tests, demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 34 sections, 10 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example illustrating that explicit reasoning enhances both interpretability and effectiveness of relevance assessment. For the query "Why do plants need light to grow?", a reasoning-based model leverages photosynthesis-related content to recognize partial relevance, whereas a model without reasoning fails to identify this connection.
  • Figure 2: Illustration of the Stepwise Advantage Masking (SAM) strategy for process-supervised reinforcement learning. In the three-step relevance reasoning, the model produces intermediate scores (denoted by \\ boxed{}) at each step, which are validated against the ground-truth label using a rule-based verifier (i.e., exact matching). We define correctness indicators $(c_1, c_2, c_3)$, where $c_i = \texttt{True}$ if step $i$ yields a correct intermediate prediction. SAM leverages these indicators to construct a stepwise advantage mask: if the final answer is correct, only correct steps are reinforced; if the final answer is incorrect, only the erroneous steps are penalized. This selective credit assignment prevents spurious reward propagation and facilitates efficient step-level optimization of generative relevance models.
  • Figure 3: Prototype of the criteria-augmented prompt.
  • Figure 4: Analysis of data efficiency on SFT.