Aligning Large Language Models with Searcher Preferences

Wei Wu; Peilun Zhou; Liyi Chen; Qimeng Wang; Chengqiang Lu; Yan Gao; Yi Wu; Yao Hu; Hui Xiong

Aligning Large Language Models with Searcher Preferences

Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

TL;DR

This work introduces SearchLLM, the first large language model (LLM) for open-ended generative search, and introduces a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO).

Abstract

The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

Aligning Large Language Models with Searcher Preferences

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 10 figures, 5 tables)

This paper contains 37 sections, 9 equations, 10 figures, 5 tables.

Introduction
Related Work
Large Language Models for Search
Alignment of Large Language Models
Methodology
System Overview
Multi-Dimensional Reward System
Reward Design
Layer I: Bottom-line Constraints.
Layer II: Behavioral Objectives.
Implementation of Hybrid Evaluation Stack
Reinforcement Learning Framework
Experiments
Experimental Setup
Datasets
...and 22 more sections

Figures (10)

Figure 1: User interaction snapshots of open-ended generative search in RedNote. The bottom-right panel summarizes failure attribution from online user feedback.
Figure 2: Overview of the alignment framework for open-ended generative search. The pipeline incorporates a multi-dimensional reward system that explicitly decouples non-negotiable bottom-line constraints (Layer I) from behavioral optimization objectives (Layer II). A hybrid evaluation stack, consisting of deterministic rules and human-calibrated LLM judges, computes fine-grained scores across multiple dimensions. These signals are synthesized via a gated aggregation mechanism to stabilize the learning signal for Group Relative Policy Optimization (GRPO).
Figure 3: Comparison on generation quality of our policy against multiple baselines evaluated by human experts.
Figure 4: Training dynamics under different reward aggregation strategies. The curves illustrate the evolution of scores across distinct reward dimensions during training, comparing the Gated Aggregation strategy against the Linear baseline.
Figure 5: Results of the online A/B test on the RedNote platform conducted in 2026. The chart displays the relative changes in key user engagement metrics for our deployed model compared to the production baseline (SFT).
...and 5 more figures

Aligning Large Language Models with Searcher Preferences

TL;DR

Abstract

Aligning Large Language Models with Searcher Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (10)