Table of Contents
Fetching ...

Interview-Informed Generative Agents for Product Discovery: A Validation Study

Zichao Wang, Alexa Siu

Abstract

Large language models (LLMs) have shown strong performance on standardized social science instruments, but their value for product discovery remains unclear. We investigate whether interview-informed generative agents can simulate user responses in concept testing scenarios. Using in-depth workflow interviews with knowledge workers, we created personalized agents and compared their evaluations of novel AI concepts against the same participants' responses. Our results show that agents are distribution-calibrated but identity-imprecise: they fail to replicate the specific individual they are grounded in, yet approximate population-level response distributions. These findings highlight both the potential and the limits of LLM simulation in design research. While unsuitable as a substitute for individual-level insights, simulation may provide value for early-stage concept screening and iteration, where distributional accuracy suffices. We discuss implications for integrating simulation responsibly into product development workflows.

Interview-Informed Generative Agents for Product Discovery: A Validation Study

Abstract

Large language models (LLMs) have shown strong performance on standardized social science instruments, but their value for product discovery remains unclear. We investigate whether interview-informed generative agents can simulate user responses in concept testing scenarios. Using in-depth workflow interviews with knowledge workers, we created personalized agents and compared their evaluations of novel AI concepts against the same participants' responses. Our results show that agents are distribution-calibrated but identity-imprecise: they fail to replicate the specific individual they are grounded in, yet approximate population-level response distributions. These findings highlight both the potential and the limits of LLM simulation in design research. While unsuitable as a substitute for individual-level insights, simulation may provide value for early-stage concept screening and iteration, where distributional accuracy suffices. We discuss implications for integrating simulation responsibly into product development workflows.

Paper Structure

This paper contains 47 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Prototype concepts for AI document workflow assistance. The four designs span different levels of intervention and integration: (a) Multi-document Q&A Assistant, (b) Smart Highlights Assistant, (c) Audio Assistant, and (d) Workflow Actions Assistant. These concepts represent a spectrum from passive support to active automation, enabling evaluation across varied user needs. Higher resolution figures are available in the Supplementary Material D.
  • Figure 2: Simulation architecture for interview-informed generative agents. The agent creation module builds memory and scratchpad representations from interview transcripts, while the agent interaction module retrieves relevant turns, reflects, and generates answers to concept test questions. Together, these components define how each agent simulates a target participant.
  • Figure 3: Individual-level accuracy of simulated responses to categorical questions. Compared to human participants, all agent variants show higher error (MAE) and lower agreement/correlation, with no significant differences across designs. Error bars indicate 95% bootstrap confidence intervals over participants (more details in Section \ref{['sec:metrics']}).
  • Figure 4: Overall population-level alignment of simulated responses. The two bar charts of categorical responses, Likert-scale and NPS-scale respectively, show interview-informed agents capture more realistic variability compared to scratchpad-only and no-information baselines. The bottom Summary statistics table using Wasserstein distance confirm that interview-informed agents achieve the best distributional similarity to both human response attempts compared to alternative agent designs.
  • Figure 5: Evaluation of open-ended response quality. Across sentiment, explanation, topic coverage, and tone, interview-informed agents outperform other agent variants but remain consistently below human participants. The largest gap arises in tone, reflecting agents’ inability to capture conversational style from transcripts.
  • ...and 9 more figures