Table of Contents
Fetching ...

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier

TL;DR

ConvApparel is introduced, a new dataset of human-AI conversations designed to address a critical realism gap, and a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization.

Abstract

The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

TL;DR

ConvApparel is introduced, a new dataset of human-AI conversations designed to address a critical realism gap, and a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization.

Abstract

The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
Paper Structure (48 sections, 11 figures, 6 tables)

This paper contains 48 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: A conversation transcript from ConvApparel between a user and the "good" conversational recommender. (The carousels with recommended items are not shown for brevity; see \ref{['fig:study_screen']}.)
  • Figure 2: Validation of LLM-as-a-judge. LLM judgments (b) capture qualitative difference between good/bad agents found in human ratings (a), but tend to exaggerate the magnitude of the gap. Human 3rd-person judgments are shown in (c).
  • Figure 3: Discriminator performance and HLS as a function of the number of training examples.
  • Figure 4: Population-Level Statistical Alignment (PLSA) with the good (top) and bad (bottom) recommenders. Data-driven simulators (ICL, SFT) consistently align more closely with human behavioral distributions than the prompted baseline across general statistics, dialog acts, and inferred user experience.
  • Figure 5: Counterfactual validation results. Simulators are trained on one agent type and tested on an unseen one (top: good$\rightarrow$bad, bottom: bad$\rightarrow$good). The ICL and SFT models show stronger generalization than the prompted baseline, adapting their behavior to the new agent and more closely matching human patterns.
  • ...and 6 more figures