Table of Contents
Fetching ...

OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation

Divij Handa, David Blincoe, Orson Adams, Yinlin Fu

TL;DR

OptAgent tackles the challenge of evaluating and optimizing e-commerce query rewriting in subjective settings where gold-standard judgments are unavailable. It combines a multi-agent LLM-based shopping simulation as a dynamic fitness signal with a language-model powered genetic algorithm to iteratively refine queries. On 1000 real Etsy queries across five categories, OptAgent achieves a 21.98% improvement over the original user queries and outperforms Best-of-N baselines by 3.36%, with the largest gains on tail and multilingual queries. The approach demonstrates a generalizable, scalable blueprint for aligning AI systems in human-centric tasks by leveraging diverse agent perspectives and evolutionary search, rather than relying on a single static judge.

Abstract

Deploying capable and user-aligned LLM-based systems necessitates reliable evaluation. While LLMs excel in verifiable tasks like coding and mathematics, where gold-standard solutions are available, adoption remains challenging for subjective tasks that lack a single correct answer. E-commerce Query Rewriting (QR) is one such problem where determining whether a rewritten query properly captures the user intent is extremely difficult to figure out algorithmically. In this work, we introduce OptAgent, a novel framework that combines multi-agent simulations with genetic algorithms to verify and optimize queries for QR. Instead of relying on a static reward model or a single LLM judge, our approach uses multiple LLM-based agents, each acting as a simulated shopping customer, as a dynamic reward signal. The average of these agent-derived scores serves as an effective fitness function for an evolutionary algorithm that iteratively refines the user's initial query. We evaluate OptAgent on a dataset of 1000 real-world e-commerce queries in five different categories, and we observe an average improvement of 21.98% over the original user query and 3.36% over a Best-of-N LLM rewriting baseline.

OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation

TL;DR

OptAgent tackles the challenge of evaluating and optimizing e-commerce query rewriting in subjective settings where gold-standard judgments are unavailable. It combines a multi-agent LLM-based shopping simulation as a dynamic fitness signal with a language-model powered genetic algorithm to iteratively refine queries. On 1000 real Etsy queries across five categories, OptAgent achieves a 21.98% improvement over the original user queries and outperforms Best-of-N baselines by 3.36%, with the largest gains on tail and multilingual queries. The approach demonstrates a generalizable, scalable blueprint for aligning AI systems in human-centric tasks by leveraging diverse agent perspectives and evolutionary search, rather than relying on a single static judge.

Abstract

Deploying capable and user-aligned LLM-based systems necessitates reliable evaluation. While LLMs excel in verifiable tasks like coding and mathematics, where gold-standard solutions are available, adoption remains challenging for subjective tasks that lack a single correct answer. E-commerce Query Rewriting (QR) is one such problem where determining whether a rewritten query properly captures the user intent is extremely difficult to figure out algorithmically. In this work, we introduce OptAgent, a novel framework that combines multi-agent simulations with genetic algorithms to verify and optimize queries for QR. Instead of relying on a static reward model or a single LLM judge, our approach uses multiple LLM-based agents, each acting as a simulated shopping customer, as a dynamic reward signal. The average of these agent-derived scores serves as an effective fitness function for an evolutionary algorithm that iteratively refines the user's initial query. We evaluate OptAgent on a dataset of 1000 real-world e-commerce queries in five different categories, and we observe an average improvement of 21.98% over the original user query and 3.36% over a Best-of-N LLM rewriting baseline.

Paper Structure

This paper contains 41 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of the OptAgent framework in query rewriting for e-commerce applications. The user's initial query is passed to the framework, where we first populate the initial generation with candidate rewrites and then perform evolutionary optimization to search for better query rewrites. The fitness of each candidate is determined by its performance in a multi-agent simulation populated by LLM-based shopper agents.
  • Figure 2: Overview of our OptAgent framework for e-commerce query rewriting application. The blue blocks represent the queries, and the green blocks represent the products listed on the platform. The user's initial query is first rephrased multiple times by an LLM, which acts as the initial population for our evolutionary framework. The following steps are repeated until the computation budget is exhausted: 1) Each query is evaluated by our multi-agent simulation, where each agent analyzes all products and stores their semantic scores in the memory. Then the purchase agent loads all the products and decides which ones to purchase, along with the total cost. 2) The semantic scores and the total amount spent constitute the final fitness function for the given query. Top N queries are passed to the next generation and used as parents to populate the next generation. 3) New queries are generated via crossover (mixing two parent queries) and mutation (altering a single parent query).
  • Figure 3: (Left) Data distribution of all the user queries in our dataset. (Right) Distribution of the multi-lingual queries.
  • Figure 4: Average fitness of the best query in the population across four generations for each query subsection. Fitness consistently improves, with diminishing results in the later generations.
  • Figure 5: Probability Distribution of our evaluation agent in selecting products for purchase out of all listed products. Similar to real users, our evaluation agent highly prefers products listed in the beginning compared to products later down the search results.