RLRF: Competitive Search Agent Design via Reinforcement Learning from Ranker Feedback
Tommy Mordo, Sagie Dekel, Omer Madmon, Moshe Tennenholtz, Oren Kurland
TL;DR
This paper addresses the problem of competitive search where publishers (LLM-based agents) modify documents to improve rankings under dynamic competition. It introduces Reinforcement Learning from Ranker Feedback (RLRF), training RL-aligned agents (RA agents) via Direct Preference Optimization on synthetic preference data generated through Static Generation and Dynamic Generation in ranking games. Key findings show that RA agents consistently outperform non-aligned baselines, generalize to unseen ranking functions, and adapt to strategic opponents, with Dynamic Generation yielding stronger performance than Static Generation. The work demonstrates the viability of RL-based alignment for publisher-driven content optimization in information retrieval, offering scalable, data-efficient training that extends to out-of-distribution rankers and multi-agent settings. It also highlights FAITHfulness considerations and transferability across ranking functions, underscoring practical implications for robust, competitive search systems.
Abstract
Competitive search is a setting where document publishers modify them to improve their ranking in response to a query. Recently, publishers have increasingly leveraged LLMs to generate and modify competitive content. We introduce Reinforcement Learning from Ranker Feedback (RLRF), a framework that trains LLMs using preference datasets derived from ranking competitions. The goal of a publisher (LLM-based) agent is to optimize content for improved ranking while accounting for the strategies of competing agents. We generate the datasets using approaches that do not rely on human-authored data. We show that our proposed agents consistently and substantially outperform previously suggested approaches for LLM-based competitive document modification. We further show that our agents are effective with ranking functions they were not trained for (i.e., out of distribution) and they adapt to strategic opponents. These findings provide support to the significant potential of using reinforcement learning in competitive search.
