Table of Contents
Fetching ...

LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments

Zikun Ye, Hema Yoganarasimhan, Yufeng Zheng

TL;DR

LOLA is introduced, a novel framework that integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery and outperforms the standard A/B test method, pure bandit algorithms, and pure-LLM approaches, particularly in scenarios with limited experimental traffic.

Abstract

Modern media firms require automated and efficient methods to identify content that is most engaging and appealing to users. Leveraging a large-scale dataset from Upworthy (a news publisher), which includes 17,681 headline A/B tests, we first investigate the ability of three pure-LLM approaches to identify the catchiest headline: prompt-based methods, embedding-based methods, and fine-tuned open-source LLMs. Prompt-based approaches perform poorly, while both OpenAI-embedding-based models and the fine-tuned Llama-3-8B achieve marginally higher accuracy than random predictions. In sum, none of the pure-LLM-based methods can predict the best-performing headline with high accuracy. We then introduce the LLM-Assisted Online Learning Algorithm (LOLA), a novel framework that integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery. LOLA combines the best pure-LLM approach with the Upper Confidence Bound algorithm to allocate traffic and maximize clicks adaptively. Our numerical experiments on Upworthy data show that LOLA outperforms the standard A/B test method (the current status quo at Upworthy), pure bandit algorithms, and pure-LLM approaches, particularly in scenarios with limited experimental traffic. Our approach is scalable and applicable to content experiments across various settings where firms seek to optimize user engagement, including digital advertising and social media recommendations.

LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments

TL;DR

LOLA is introduced, a novel framework that integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery and outperforms the standard A/B test method, pure bandit algorithms, and pure-LLM approaches, particularly in scenarios with limited experimental traffic.

Abstract

Modern media firms require automated and efficient methods to identify content that is most engaging and appealing to users. Leveraging a large-scale dataset from Upworthy (a news publisher), which includes 17,681 headline A/B tests, we first investigate the ability of three pure-LLM approaches to identify the catchiest headline: prompt-based methods, embedding-based methods, and fine-tuned open-source LLMs. Prompt-based approaches perform poorly, while both OpenAI-embedding-based models and the fine-tuned Llama-3-8B achieve marginally higher accuracy than random predictions. In sum, none of the pure-LLM-based methods can predict the best-performing headline with high accuracy. We then introduce the LLM-Assisted Online Learning Algorithm (LOLA), a novel framework that integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery. LOLA combines the best pure-LLM approach with the Upper Confidence Bound algorithm to allocate traffic and maximize clicks adaptively. Our numerical experiments on Upworthy data show that LOLA outperforms the standard A/B test method (the current status quo at Upworthy), pure bandit algorithms, and pure-LLM approaches, particularly in scenarios with limited experimental traffic. Our approach is scalable and applicable to content experiments across various settings where firms seek to optimize user engagement, including digital advertising and social media recommendations.
Paper Structure (38 sections, 8 equations, 14 figures, 11 tables, 5 algorithms)

This paper contains 38 sections, 8 equations, 14 figures, 11 tables, 5 algorithms.

Figures (14)

  • Figure 1: Zero-Shot Prompting for Headline Selection.
  • Figure 2: In-Context Learning Prompt for Headline Selection.
  • Figure 3: The pipeline of the headline selection using LLM text embeddings. We use an A/B test with three headlines for illustration. Headlines 1, 2, and 3 are natural language sentences, while Embedding 1, 2, and 3 are numerical vectors.
  • Figure 4: Loss curve and accuracy curve as the number of training epochs increases.
  • Figure 5: Average clicks per experiment per period under different time horizon multipliers. Note that the Y-axis captures the average clicks per test per period. For instance, if there is a test with two headlines receiving 1 and 2 clicks, respectively, under $\tau=100$, then the average click per period in this test is calculated as $(1+2)/100=0.03$. The Y value is simply the average of this number $0.03$ over all tests. This measure scales well with the platform's total clicks in tests because headlines in different tests with different numbers of headlines take the same weight in this measure.
  • ...and 9 more figures