Table of Contents
Fetching ...

Measuring Self-Rating Bias in LLM-Generated Survey Data: A Semantic Similarity Framework for Independent Scale Mapping

Eduardo Vera Pichardo

TL;DR

Calibrate and validate Semantic Similarity Rating (SSR), which decouples generation from scale mapping via embedding-based cosine similarity against predefined anchor statements via embedding-based cosine similarity against predefined anchor statements.

Abstract

Synthetic survey data generated by large language models (LLMs) suffers from a fundamental circularity: the same model family that generates text responses also maps them to numerical scales. We calibrate and validate Semantic Similarity Rating (SSR; Maier et al., 2024), which decouples generation from scale mapping via embedding-based cosine similarity against predefined anchor statements. Configuration experiments (N=17 pilot, N=69 cross-validation across 8 domains) show that naturalistic behavioral anchors outperform formal jargon by 29 percentage points (pp), and that SSR achieves 65-67% exact match and 91% within plus/minus 1; a cross-model test with OpenAI text-embedding-3-small reaches 77% exact, confirming cross-provider generalization. Direct LLM baselines (Claude 87%, GPT-4o 83%) establish that SSR's contribution is methodological independence, not accuracy superiority. A control condition removing question text from the LLM prompt actually improves LLM accuracy, ruling out information asymmetry as the explanation for SSR's lower accuracy. A pre-registered circularity experiment (N=345) reveals 4x compressed error variance in LLM rating (sigma^2 = 0.21 vs 0.87 for SSR) and systematic directional bias. A cross-model control (GPT-4o rating Claude-generated text) shows nearly identical compression (within/cross ratio = 0.93), indicating variance compression is a general LLM property rather than a within-model artifact. The calibration dataset, anchor library, and source code are publicly available (see Data Availability).

Measuring Self-Rating Bias in LLM-Generated Survey Data: A Semantic Similarity Framework for Independent Scale Mapping

TL;DR

Calibrate and validate Semantic Similarity Rating (SSR), which decouples generation from scale mapping via embedding-based cosine similarity against predefined anchor statements via embedding-based cosine similarity against predefined anchor statements.

Abstract

Synthetic survey data generated by large language models (LLMs) suffers from a fundamental circularity: the same model family that generates text responses also maps them to numerical scales. We calibrate and validate Semantic Similarity Rating (SSR; Maier et al., 2024), which decouples generation from scale mapping via embedding-based cosine similarity against predefined anchor statements. Configuration experiments (N=17 pilot, N=69 cross-validation across 8 domains) show that naturalistic behavioral anchors outperform formal jargon by 29 percentage points (pp), and that SSR achieves 65-67% exact match and 91% within plus/minus 1; a cross-model test with OpenAI text-embedding-3-small reaches 77% exact, confirming cross-provider generalization. Direct LLM baselines (Claude 87%, GPT-4o 83%) establish that SSR's contribution is methodological independence, not accuracy superiority. A control condition removing question text from the LLM prompt actually improves LLM accuracy, ruling out information asymmetry as the explanation for SSR's lower accuracy. A pre-registered circularity experiment (N=345) reveals 4x compressed error variance in LLM rating (sigma^2 = 0.21 vs 0.87 for SSR) and systematic directional bias. A cross-model control (GPT-4o rating Claude-generated text) shows nearly identical compression (within/cross ratio = 0.93), indicating variance compression is a general LLM property rather than a within-model artifact. The calibration dataset, anchor library, and source code are publicly available (see Data Availability).
Paper Structure (91 sections, 5 equations, 8 figures, 21 tables)

This paper contains 91 sections, 5 equations, 8 figures, 21 tables.

Figures (8)

  • Figure 1: SSR pipeline architecture. The generation model (Claude Haiku 4.5) produces text responses; the embedding model independently maps them to scale ratings via cosine similarity. The primary configuration uses Voyage 3.5-lite (1024 dims); cross-model validation with OpenAI text-embedding-3-small (1536 dims) is reported in Section \ref{['sec:cross-model']}. The two models are architecturally independent, substantially reducing circular validation.
  • Figure 2: Cosine similarity compression illustrated on test case LIK-3 (expected rating = 3). (a) Raw cosine similarities between the response text and all 5 anchor statements span only 0.083 (from 0.812 to 0.895), making the correct anchor difficult to identify. (b) After min-max normalization, the same similarities are stretched to $[0, 1]$: the correct anchor (Rating 3) reaches 1.00, clearly separated from the next-highest (Rating 1, 0.31).
  • Figure 3: Accuracy progression across the three calibration experiments. Naturalistic anchors account for the largest improvement (+29 pp), followed by asymmetric embedding (+6 pp). Light bars indicate within $\pm$1 accuracy.
  • Figure 4: Similarity heatmaps for symmetric (left) vs. asymmetric (right) embedding. Each cell shows the cosine similarity between a test response and the 5 anchor statements. Asymmetric embedding produces wider similarity spread, improving discrimination between adjacent scale points.
  • Figure 5: Per-domain exact match accuracy on the expanded 69-case test set. Three performance tiers are visible: strong (satisfaction, purchase intent), moderate (agreement, value, likelihood), and weak (ease, trust, importance).
  • ...and 3 more figures