Table of Contents
Fetching ...

Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

Haochen Wang, Yi Wu, Daryl Chang, Li Wei, Lukasz Heldt

TL;DR

The paper tackles the challenge of aligning offline proxy objectives with long-term online user satisfaction in industrial-scale recommender systems. It introduces a self-evolving framework in which specialized LLM agents operate in a fast Offline Inner Loop and a slow Online Outer Loop to autonomously generate, validate, and deploy model changes. The approach enables semantic discovery of novel architectures and multi-objective rewards, accelerating experimentation and delivering measurable gains in production on YouTube. By automating hypothesis generation, code production, and experiment orchestration, the framework promises a significant reduction in the idea-to-data cycle and suggests a future where ML Engineers focus on guardrails and strategic vision.

Abstract

Optimizing large-scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a non-trivial task, traditionally relying on extensive manual iterations to test new hypotheses. We propose a self-evolving system that leverages Large Language Models (LLMs), specifically those from Google's Gemini family, to autonomously generate, train, and deploy high-performing, complex model changes within an end-to-end automated workflow. The self-evolving system is comprised of an Offline Agent (Inner Loop) that performs high-throughput hypothesis generation using proxy metrics, and an Online Agent (Outer Loop) that validates candidates against delayed north star business metrics in live production. Our agents act as specialized Machine Learning Engineers (MLEs): they exhibit deep reasoning capabilities, discovering novel improvements in optimization algorithms and model architecture, and formulating innovative reward functions that target long-term user engagement. The effectiveness of this approach is demonstrated through several successful production launches at YouTube, confirming that autonomous, LLM-driven evolution can surpass traditional engineering workflows in both development velocity and model performance.

Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

TL;DR

The paper tackles the challenge of aligning offline proxy objectives with long-term online user satisfaction in industrial-scale recommender systems. It introduces a self-evolving framework in which specialized LLM agents operate in a fast Offline Inner Loop and a slow Online Outer Loop to autonomously generate, validate, and deploy model changes. The approach enables semantic discovery of novel architectures and multi-objective rewards, accelerating experimentation and delivering measurable gains in production on YouTube. By automating hypothesis generation, code production, and experiment orchestration, the framework promises a significant reduction in the idea-to-data cycle and suggests a future where ML Engineers focus on guardrails and strategic vision.

Abstract

Optimizing large-scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a non-trivial task, traditionally relying on extensive manual iterations to test new hypotheses. We propose a self-evolving system that leverages Large Language Models (LLMs), specifically those from Google's Gemini family, to autonomously generate, train, and deploy high-performing, complex model changes within an end-to-end automated workflow. The self-evolving system is comprised of an Offline Agent (Inner Loop) that performs high-throughput hypothesis generation using proxy metrics, and an Online Agent (Outer Loop) that validates candidates against delayed north star business metrics in live production. Our agents act as specialized Machine Learning Engineers (MLEs): they exhibit deep reasoning capabilities, discovering novel improvements in optimization algorithms and model architecture, and formulating innovative reward functions that target long-term user engagement. The effectiveness of this approach is demonstrated through several successful production launches at YouTube, confirming that autonomous, LLM-driven evolution can surpass traditional engineering workflows in both development velocity and model performance.
Paper Structure (34 sections, 2 equations, 3 figures, 2 tables)

This paper contains 34 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The Self-Evolving System Architecture. The framework operates as a dual-loop, self-evolving system centered around a shared context containing a persistent knowledge base and an Experiment Journal of historical trials and their resulting metrics. The Offline Agent (Inner Loop) serves as the high-frequency cognitive core, where LLMs are invoked to instantiate specialized reasoning personas that generate and refine hypotheses into executable code through a closed-loop "Think-Code-Verify" cycle. Offline tool calls are made to evaluate candidates and filter them to a smaller subset. High-potential survivors are promoted to the Online Agent (Outer Loop), which manages the asynchronous transition of every proposal through a five-phase Directed Acyclic Graph (DAG). This Outer Loop ensures model integrity and safety through automated push evaluations and live traffic monitoring before closing the loop by serializing online north star metrics back into the Experiment Journal.
  • Figure 2: Agent Performance (Normalized $\mathcal{L}_{\text{proxy}}$) across Different Model Sizes and Context Engineering Strategies for Improving Optimizer. The plot shows the mean z-score of the loss. Lower scores indicate superior performance.
  • Figure 3: LLM Prompt. {..} are populated based on the agent's specialized persona (Optimizer, Architecture, or Reward), and [..] are populated from shared context.