Large Language Models for Market Research: A Data-augmentation Approach

Mengxin Wang; Dennis J. Zhang; Heng Zhang

Large Language Models for Market Research: A Data-augmentation Approach

Mengxin Wang, Dennis J. Zhang, Heng Zhang

TL;DR

The paper tackles the scalability challenge in market research by integrating Large Language Model (LLM)-generated data with real responses in conjoint analysis through a statistically principled data-augmentation framework. By treating AI-generated labels as informative but biased signals, the authors design an AI-Augmented Estimator (AAE) that learns a mapping from AI to human decisions via a first-stage model $g_j(x,z;\theta^*)$ and then optimizes a likelihood-like objective over the real-label space using the auxiliary AI data. They establish consistency and asymptotic normality for $\hat{\boldsymbol{\beta}}^{AAE}$ and show variance dominance over naive or AI-only approaches under mild regularity conditions, with the potential for substantial data and cost savings. Empirically, they validate the framework on COVID-19 vaccine preferences and a sports-car dataset, demonstrating that AAE reduces estimation error and yields significant data savings (up to 24.9%–79.8% depending on the model version and prompting technique). The results suggest LLM-generated data can be a valuable complement to real data within a robust statistical framework, enabling scalable, cost-effective market research while preserving statistical guarantees.

Abstract

Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. Our method leverages transfer learning principles to debias the LLM-generated data using a small amount of human data. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.

Large Language Models for Market Research: A Data-augmentation Approach

TL;DR

and then optimizes a likelihood-like objective over the real-label space using the auxiliary AI data. They establish consistency and asymptotic normality for

and show variance dominance over naive or AI-only approaches under mild regularity conditions, with the potential for substantial data and cost savings. Empirically, they validate the framework on COVID-19 vaccine preferences and a sports-car dataset, demonstrating that AAE reduces estimation error and yields significant data savings (up to 24.9%–79.8% depending on the model version and prompting technique). The results suggest LLM-generated data can be a valuable complement to real data within a robust statistical framework, enabling scalable, cost-effective market research while preserving statistical guarantees.

Abstract

Paper Structure (35 sections, 9 theorems, 68 equations, 5 figures, 8 tables)

This paper contains 35 sections, 9 theorems, 68 equations, 5 figures, 8 tables.

Introduction
Literature
Model and the Data-augmentation Approach
Setup
Data Generation Process and the Best-in-class Estimation.
Primary, Auxiliary and Naïve Estimators
Estimation with AI-augmented Data
Theoretical Properties of Our Estimator
Main Theoretical Results
Value of AI-Augmented Estimation
Proof Sketch of Theorem \ref{['thm:consistency']}
Proof of the First Part of Theorem \ref{['thm:consistency']}.
Proof of the Second Part of Theorem \ref{['thm:consistency']}.
Empirical Analysis I: COVID-19 Vaccination
Empirical Setup
...and 20 more sections

Key Result

Proposition 3.2

When $m, n \rightarrow \infty$, in general, $\hat{\bm{\beta}}^{\sf A}$ and $\hat{\bm{\beta}}^{\sf Naive}$ are not consistent estimators of ${\bm{\beta}}^*$.

Figures (5)

Figure 1: LLM-generated Data $\neq$ Human Data
Figure 2: Illustration of the dataset
Figure 3: Illustration of $\hat{\bm{\beta}}^{\sf AAE}$ and $\hat{\bm{\beta}}^{\sf Naive}$ by Feature
Figure 4: Conjoint estimation accuracy vs. market research costs
Figure 5: Minimum Eigenvalues and Absolute Probability Differences

Theorems & Definitions (10)

Example 3.1: AI-Generated Data Cannot Replace Human Data
Proposition 3.2: Bias of $\hat{\bm{\beta}}^{\sf A}$ and $\hat{\bm{\beta}}^{\sf Naive}$.
Theorem 4.3: Consistency and Asymptotic Normality of AI-augmented Estimator
Proposition 4.4: Dominance of ${\sf Var}^{\sf AAE}$.
Lemma 4.5: Unique Optimizer
Lemma 4.6: Uniform Convergence
Lemma 4.7: Convergence to $\bm{\Omega}$ and $\bm{\Gamma}$
Lemma A.1: Derivative Computation
Lemma A.2: Bounds on the Quadratic Form
Proposition B.1: Projection Error Decomposition under MNL

Large Language Models for Market Research: A Data-augmentation Approach

TL;DR

Abstract

Large Language Models for Market Research: A Data-augmentation Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)