Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

Ziyang Zeng; Heming Jing; Jindong Chen; Xiangli Li; Hongyu Liu; Yixuan He; Zhengyu Li; Yige Sun; Zheyong Xie; Yuqing Yang; Shaosheng Cao; Jun Fan; Yi Wu; Yao Hu

Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

Ziyang Zeng, Heming Jing, Jindong Chen, Xiangli Li, Hongyu Liu, Yixuan He, Zhengyu Li, Yige Sun, Zheyong Xie, Yuqing Yang, Shaosheng Cao, Jun Fan, Yi Wu, Yao Hu

TL;DR

The paper tackles open-domain search relevance by moving from traditional scalar scoring to grounded, multi-step reasoning trained with reinforcement learning. It introduces Criteria-augmented prompts and Stepwise Advantage Masking (SAM) to provide step-level credit assignment, supported by a distillation-based warm-up and group-relative RL objectives. Empirical results on RANDOM and LONGTAIL benchmarks show that ProcessRL-Reasoning outperforms SFT and other RL baselines, with notable data efficiency and strong online deployment performance via a lightweight student model. The work demonstrates practical significance for industrial search, achieving improved user engagement and ranking quality while maintaining production efficiency, and outlines future directions for dynamic criteria and LLM-based verifiers.

Abstract

Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive experiments on industrial datasets, along with online A/B tests, demonstrate the effectiveness of our approach.

Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

TL;DR

Abstract

Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)