Table of Contents
Fetching ...

Aligning Large Language Models with Searcher Preferences

Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

TL;DR

This work introduces SearchLLM, the first large language model (LLM) for open-ended generative search, and introduces a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO).

Abstract

The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

Aligning Large Language Models with Searcher Preferences

TL;DR

This work introduces SearchLLM, the first large language model (LLM) for open-ended generative search, and introduces a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO).

Abstract

The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.
Paper Structure (37 sections, 9 equations, 10 figures, 5 tables)

This paper contains 37 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: User interaction snapshots of open-ended generative search in RedNote. The bottom-right panel summarizes failure attribution from online user feedback.
  • Figure 2: Overview of the alignment framework for open-ended generative search. The pipeline incorporates a multi-dimensional reward system that explicitly decouples non-negotiable bottom-line constraints (Layer I) from behavioral optimization objectives (Layer II). A hybrid evaluation stack, consisting of deterministic rules and human-calibrated LLM judges, computes fine-grained scores across multiple dimensions. These signals are synthesized via a gated aggregation mechanism to stabilize the learning signal for Group Relative Policy Optimization (GRPO).
  • Figure 3: Comparison on generation quality of our policy against multiple baselines evaluated by human experts.
  • Figure 4: Training dynamics under different reward aggregation strategies. The curves illustrate the evolution of scores across distinct reward dimensions during training, comparing the Gated Aggregation strategy against the Linear baseline.
  • Figure 5: Results of the online A/B test on the RedNote platform conducted in 2026. The chart displays the relative changes in key user engagement metrics for our deployed model compared to the production baseline (SFT).
  • ...and 5 more figures