Table of Contents
Fetching ...

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, Thomas V. Wiecki

TL;DR

This paper tackles the high cost and biases of traditional consumer panels by introducing semantic similarity rating (SSR), which prompts LLMs to produce free-text purchase-intent statements that are mapped to a 5-point Likert scale via embedding similarity to anchor statements. Across 57 personal-care product surveys (N=9,300 real respondents), SSR recovers about 90% of the maximum achievable correlation with human data and achieves realistic distribution similarity (KS similarity > 0.85), while also yielding rich qualitative rationales. Importantly, SSR does not require training data or fine-tuning, making it a scalable, interpretable plug-in for concept testing that preserves traditional survey metrics but with the added benefit of qualitative insights and a broader distribution of responses. The approach demonstrates substantial potential to augment or accelerate early-stage product research, with caveats around anchor design, demographic coverage, and domain knowledge encoded in the LLMs.

Abstract

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

TL;DR

This paper tackles the high cost and biases of traditional consumer panels by introducing semantic similarity rating (SSR), which prompts LLMs to produce free-text purchase-intent statements that are mapped to a 5-point Likert scale via embedding similarity to anchor statements. Across 57 personal-care product surveys (N=9,300 real respondents), SSR recovers about 90% of the maximum achievable correlation with human data and achieves realistic distribution similarity (KS similarity > 0.85), while also yielding rich qualitative rationales. Importantly, SSR does not require training data or fine-tuning, making it a scalable, interpretable plug-in for concept testing that preserves traditional survey metrics but with the added benefit of qualitative insights and a broader distribution of responses. The approach demonstrates substantial potential to augment or accelerate early-stage product research, with caveats around anchor design, demographic coverage, and domain knowledge encoded in the LLMs.

Abstract

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

Paper Structure

This paper contains 29 sections, 9 equations, 35 figures, 1 table.

Figures (35)

  • Figure 1: Different response generation procedures and SSR response-likelihood mapping. (A) A synthetic consumer is constructed by instructing an LLM to impersonate a consumer with certain demographic properties and show them a product concept as an image containing a description and possibly concept art (see App. \ref{['sec:product-concept-examples']}). The synthetic consumer is then asked about their purchase intent. (1) In the direct Likert-rating approach, the LLM's response is restricted to one of 1, 2, 3, 4, or 5. (2) Alternatively, we let the LLM write a brief textual response about their PI. Subsequently, we prompt the same model to be a Likert-rating "expert" and map the textual response to an integer between 1 and 5. (3) Because textual responses can result in varied ratings on the 5pt Likert scale, we introduce the semantic similarity rating method. We retrieve the embedding vector for the textual response from a corresponding model, compare it to five reference response embedding vectors and construct a response distribution on the Likert scale. (B) In an embedding space, the synthetic response will have a certain angular distance to any other statement. We construct a reference set of five rating responses, each corresponding to an integer on the Likert scale. Then, the response likelihood of any integer is set to be proportional to the cosine similarity between the synthetic response vector and the corresponding reference response vector.
  • Figure 2: Comparison of real and synthetic surveys based on GPT-4o with $T_\mathrm{LLM}=0.5$. (A) Mean purchase intent comparison for (A.i) Direct likert ratings (DLRs), (A.ii) textual elicitation with follow-up Likert ratings (FLRs) and (A.iii) semantic similarity ratings (SSRs). (B) Eight example survey response distributions for real surveys and the corresponding synthetic surveys based on DLR, FLR, and SSR, respectively.
  • Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-4o with $T_\mathrm{LLM}=0.5$ for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).
  • Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-4o and Gem-2f). Error bars represent standard errors.
  • Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.
  • ...and 30 more figures