SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation

Tim Rieder; Marian Schneider; Mario Truss; Vitaly Tsaplin; Alina Rublea; Sinem Dere; Francisco Chicharro Sanz; Tobias Reiss; Mustafa Doga Dogan

SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation

Tim Rieder, Marian Schneider, Mario Truss, Vitaly Tsaplin, Alina Rublea, Sinem Dere, Francisco Chicharro Sanz, Tobias Reiss, Mustafa Doga Dogan

TL;DR

SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents, is presented, which emphasizes speed, early feedback, actionable rationales, and audience specification.

Abstract

A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales. Through a formative study with experimentation practitioners, we identified scenarios where traffic constraints hinder testing, including low-traffic pages, multi-variant comparisons, micro-optimizations, and privacy-sensitive contexts. Our design emphasizes speed, early feedback, actionable rationales, and audience specification. We evaluate SimAB against 47 historical A/B tests with known outcomes, achieving 67% overall accuracy, increasing to 83% for high-confidence cases. Additional experiments show robustness to naming and positional bias and demonstrate accuracy gains from personas. Practitioner feedback suggests that SimAB supports faster evaluation cycles and rapid screening of designs difficult to assess with traditional A/B tests.

SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation

TL;DR

Abstract

Paper Structure (46 sections, 10 figures, 3 tables)

This paper contains 46 sections, 10 figures, 3 tables.

Introduction
Related Work
A/B Testing and Optimization Platforms
LLMs for Design Feedback and Evaluation
User Simulating Agents
Formative Study
Informative Interviews with Practitioners
A/B Testing Pain Points
Design Implications & Principles
System Design
Input Processing
Retrieval-Augmented Generation
Persona Generation
Batched Generation & Diversity Constraints
Persona Simulation
...and 31 more sections

Figures (10)

Figure 1: Example Input. Control can be found on the left; Challenger on the right. Together with a primary conversion goal ('Will users donate to the Wikimedia Foundation?'), this constitutes the mandatory input to start an A/B test using SimAB. The test itself is created for this work, using an old Wikipedia A/B test from 2010 fitted to the current Wikipedia page for https://en.wikipedia.org/wiki/A/B_testing. They tested 2 different group photo positions to determine the more effective one in terms of Wikimedia Foundation donations.
Figure 2: Comparing the target distribution as provided in audience restrictions from a user to the distribution generated by the SimABpersona generation module. Blue: baseline without providing segment information to the LLM call, orange: using the segments as audience restriction to align generated personas on it. Left: absolute mean squared error (MSE), right: root mean squared error (RMSE). Term Frequency-Inverse Document Frequency (TF-IDF): evaluates a word's importance in a document relative to the whole corpus; using cosine-similarity to measure the similarity of words in two collections. Both TF-IDF and direct requests to LLMs were used to classify generated personas to the most similar persona segment (the alignment objective provided in audience restrictions). Note that outliers (RMSE) in particular are reduced significantly.
Figure 3: Example Personas. The generated personas include a context, ranging from age, education, and background, to the present context (the main goals in the current browser session).
Figure 4: Example Winner. After just 31 agents (1-2 minutes, all started in parallel) conclude, statistical significance is reached. The result coincides with the ground truth data gathered from real users.
Figure 5: Example Insights. The rationales of all agents are aggregated to show the top reasons why Control/Challenger/None is preferred (for those that preferred Control/Challenger/None). Additionally, SimAB proposes immediate action items to improve and iterate on the Challenger.
...and 5 more figures

SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation

TL;DR

Abstract

SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)