Table of Contents
Fetching ...

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Serin Kim, Sangam Lee, Dongha Lee

TL;DR

Persona2Web is the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions.

Abstract

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

TL;DR

Persona2Web is the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions.

Abstract

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.
Paper Structure (56 sections, 4 equations, 5 figures, 16 tables)

This paper contains 56 sections, 4 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: A personalized web agent generates user-specific responses by leveraging user history, whereas a general web agent often produces generic or random outputs that fail to align with user preference.
  • Figure 2: Overview of Persona2Web construction pipeline and reasoning-aware evaluation process.
  • Figure 3: Performance across query ambiguity levels for each agent architecture and history access scheme. Bars show Success Rate (left axis) for level 0, 1, 2 (gray, medium, dark blue). Lines show preference (red) and website (green) scores (right axis). o3, GPT-4.1, and Qwen3-80B-Instruct are used as backbone models.
  • Figure 4: Detailed statistics of user profile provided from Persona2Web benchmark.
  • Figure 5: Error statistics for AgentOccam across backbone models (top) and a breakdown of personalization-related errors for Gemini 2.5 Flash (bottom).