Table of Contents
Fetching ...

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada, Yui Furukawa, Kyosuke Bunji

Abstract

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Abstract

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
Paper Structure (66 sections, 18 equations, 4 figures, 3 tables)

This paper contains 66 sections, 18 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Proposed framework to quantify and compare the socially desirable responding (SDR) of LLMs.
  • Figure 2: Mean $\pm 95\%$ CI Big Five profiles for two representative high-capacity LLMs (GPT-5 and Gemini 2.5 Pro) under honest vs. fake-good instructions. For each model, trait estimates are shown separately for Likert responses (left) and graded forced-choice (GFC) responses (right).
  • Figure 3: SDR effects on Big Five trait estimates (direction-corrected Cohen's $\tilde{d}_z$ computed on latent $\hat{\theta}$). Positive values indicate shifts toward the socially desirable direction (higher A, C, E, and O; lower N). The Likert panel shows consistently large, positive SDR across models and traits, whereas the GFC panel shows substantially attenuated (often near-zero) effects.
  • Figure 4: SDR--recovery trade-off across response formats. For each LLM and format, we plot summary SDR shift (direction-corrected Cohen's $\tilde{d}_z$) against ground-truth recovery (Pearson $r$ between true persona scores $z$ and IRT-estimated $\hat{\theta}$) under honest responding, both aggregated across traits. Grey lines connect formats within a model. Background colors indicate practical interpretation zones: for SDR, $|\tilde{d}_z|\le 0.2$ (Cohen1988Power's (Cohen1988Power) "small" effect) is considered a practically negligible shift (recommended), $0.2<|\tilde{d}_z|\le 0.5$ (up to "medium") as caution but potentially acceptable, and $|\tilde{d}_z|>0.5$ as avoid; these cutoffs are adopted from the equivalence-testing literature discussing equivalence bounds (smallest effect size of interest, SESOI) for assessing practically negligible differences LakensScheelIsager2018Equivalence. For recovery, following the convergent-validity literature, we interpret $r\ge 0.70$ as strong (recommended), $0.50\le r<0.70$ as acceptable/moderate, and $r<0.50$ as insufficient AbmaEtal2016Convergent.