Table of Contents
Fetching ...

Benchmarking In-context Experiential Learning Through Repeated Product Recommendations

Gilbert Yang, Yaqin Chen, Thomson Yen, Hongseok Namkoong

TL;DR

The paper argues that real-world intelligence requires in-context experiential learning across multiple episodes. It introduces BELA, a scalable benchmark that combines real Amazon products, diverse user personas, and an LLM-powered customer simulator to study multi-turn, multi-episode recommendations where rewards are expressed in natural language. Across experiments, state-of-the-art models fail to meaningfully improve from experience and show poor uncertainty calibration, highlighting a gap in current agentic capabilities. BELA offers a realistic platform for studying user modeling, preference elicitation, and exploration in dynamic, uncertain environments, with potential to drive future advances in experiential learning and planning.

Abstract

To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents' ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current frontier models struggle to meaningfully improve across episodes, underscoring the need for agentic systems with strong in-context learning capabilities.

Benchmarking In-context Experiential Learning Through Repeated Product Recommendations

TL;DR

The paper argues that real-world intelligence requires in-context experiential learning across multiple episodes. It introduces BELA, a scalable benchmark that combines real Amazon products, diverse user personas, and an LLM-powered customer simulator to study multi-turn, multi-episode recommendations where rewards are expressed in natural language. Across experiments, state-of-the-art models fail to meaningfully improve from experience and show poor uncertainty calibration, highlighting a gap in current agentic capabilities. BELA offers a realistic platform for studying user modeling, preference elicitation, and exploration in dynamic, uncertain environments, with potential to drive future advances in experiential learning and planning.

Abstract

To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents' ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current frontier models struggle to meaningfully improve across episodes, underscoring the need for agentic systems with strong in-context learning capabilities.

Paper Structure

This paper contains 32 sections, 6 equations, 24 figures, 1 table.

Figures (24)

  • Figure 1: Top Left. Typical agentic benchmarks (e.g. yao2024taubench) focus on settings where all information is provided initially, and the model is tasked with producing the correct answer in a zero-shot fashion. Bottom Left. Recent benchmarks (e.g. li2024mediq) for LLM agents increasingly focus on multi-turn settings. Right. Our In-Context Experiential Learning setting.
  • Figure 2: Benchmark for Experiential Learning and Active Exploration (BELA). An exemplar recommendation dialogues for in-context experiential learning across 2 customer personas and 3 choice sets.
  • Figure 3: An exemplar persona taken from li2025llmgeneratedpersonapromise and their preferences over a category of products from hou2024_amazonproducts. Scoring is done by GPT-4o and Gemini-1.5-Pro. Consistent with the persona’s curly hair, curl-enhancing products are rated highly, whereas straightening products receive low scores.
  • Figure 4: We use a predefined tree of categories spotlight_categories and filtered the ones unsuitable as choice sets. We then score the products within each choice set with a persona-simulating LLM.
  • Figure 5: LEFT: Models perform better than random, but significantly worse than the oracle with no learning across episodes RIGHT: Gemini models tend to ask fewer questions in later episodes
  • ...and 19 more figures