Measuring Self-Rating Bias in LLM-Generated Survey Data: A Semantic Similarity Framework for Independent Scale Mapping
Eduardo Vera Pichardo
TL;DR
Calibrate and validate Semantic Similarity Rating (SSR), which decouples generation from scale mapping via embedding-based cosine similarity against predefined anchor statements via embedding-based cosine similarity against predefined anchor statements.
Abstract
Synthetic survey data generated by large language models (LLMs) suffers from a fundamental circularity: the same model family that generates text responses also maps them to numerical scales. We calibrate and validate Semantic Similarity Rating (SSR; Maier et al., 2024), which decouples generation from scale mapping via embedding-based cosine similarity against predefined anchor statements. Configuration experiments (N=17 pilot, N=69 cross-validation across 8 domains) show that naturalistic behavioral anchors outperform formal jargon by 29 percentage points (pp), and that SSR achieves 65-67% exact match and 91% within plus/minus 1; a cross-model test with OpenAI text-embedding-3-small reaches 77% exact, confirming cross-provider generalization. Direct LLM baselines (Claude 87%, GPT-4o 83%) establish that SSR's contribution is methodological independence, not accuracy superiority. A control condition removing question text from the LLM prompt actually improves LLM accuracy, ruling out information asymmetry as the explanation for SSR's lower accuracy. A pre-registered circularity experiment (N=345) reveals 4x compressed error variance in LLM rating (sigma^2 = 0.21 vs 0.87 for SSR) and systematic directional bias. A cross-model control (GPT-4o rating Claude-generated text) shows nearly identical compression (within/cross ratio = 0.93), indicating variance compression is a general LLM property rather than a within-model artifact. The calibration dataset, anchor library, and source code are publicly available (see Data Availability).
