Large Language Models for Market Research: A Data-augmentation Approach
Mengxin Wang, Dennis J. Zhang, Heng Zhang
TL;DR
The paper tackles the scalability challenge in market research by integrating Large Language Model (LLM)-generated data with real responses in conjoint analysis through a statistically principled data-augmentation framework. By treating AI-generated labels as informative but biased signals, the authors design an AI-Augmented Estimator (AAE) that learns a mapping from AI to human decisions via a first-stage model $g_j(x,z;\theta^*)$ and then optimizes a likelihood-like objective over the real-label space using the auxiliary AI data. They establish consistency and asymptotic normality for $\hat{\boldsymbol{\beta}}^{AAE}$ and show variance dominance over naive or AI-only approaches under mild regularity conditions, with the potential for substantial data and cost savings. Empirically, they validate the framework on COVID-19 vaccine preferences and a sports-car dataset, demonstrating that AAE reduces estimation error and yields significant data savings (up to 24.9%–79.8% depending on the model version and prompting technique). The results suggest LLM-generated data can be a valuable complement to real data within a robust statistical framework, enabling scalable, cost-effective market research while preserving statistical guarantees.
Abstract
Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. Our method leverages transfer learning principles to debias the LLM-generated data using a small amount of human data. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.
