Table of Contents
Fetching ...

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

TL;DR

This work introduces GrowthHacker, a benchmark that uses LLM-based agents to automatically modify OPE code and iteratively improve offline evaluation accuracy. It replaces brittle multi-agent systems with a two-agent architecture (Analyzer and Coder) and demonstrates 100% reliability and substantial improvements, notably a 106.7% average gain among positive outcomes, across OBP and Scope-RL datasets. The study provides a comprehensive experimental design with 504 runs, analyzes failure modes, and shows library- and environment-specific patterns, highlighting both the potential and current limitations of automated OPE optimization. The results suggest that automated code refinement can serve as a scalable, data-driven mechanism to enhance production decision-making without online experiments, with future work pointing to memory, exploration-exploitation balance, and human-in-the-loop safeguards.

Abstract

With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

TL;DR

This work introduces GrowthHacker, a benchmark that uses LLM-based agents to automatically modify OPE code and iteratively improve offline evaluation accuracy. It replaces brittle multi-agent systems with a two-agent architecture (Analyzer and Coder) and demonstrates 100% reliability and substantial improvements, notably a 106.7% average gain among positive outcomes, across OBP and Scope-RL datasets. The study provides a comprehensive experimental design with 504 runs, analyzes failure modes, and shows library- and environment-specific patterns, highlighting both the potential and current limitations of automated OPE optimization. The results suggest that automated code refinement can serve as a scalable, data-driven mechanism to enhance production decision-making without online experiments, with future work pointing to memory, exploration-exploitation balance, and human-in-the-loop safeguards.

Abstract

With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.

Paper Structure

This paper contains 45 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Flow Diagram for GrowthHacker, a benchmark system for optimizing Off-Policy Evaluation through code modification, using a LLM or LLM agent.
  • Figure 2: Visual Illustration for Two-Agent Architecture
  • Figure 3: Visual Illustration for Performance of Estimator on RTB Basics (discrete).
  • Figure 4: Stacked bar chart showing normalized outcome proportions per framework. Each bar sums to 100%.