Table of Contents
Fetching ...

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy

TL;DR

This paper critiques the standard practice of evaluating LLM values and opinions with constrained, multiple-choice surveys by using the Political Compass Test (PCT) as a case study. It systematically demonstrates that forcing models to select a single option, testing robustness to paraphrase, and shifting from constrained to open-ended prompts produce substantially different results, often with instability and non-generalizability. Through experiments across ten models and four evaluation settings (unforced MC, forced MC, paraphrase, open-ended), the authors argue for context-specific, robust evaluations that mirror real user interactions and caution against global claims about LLM values. They propose three practical recommendations: align evaluations with actual use cases, perform extensive robustness checks, and limit claims to local contexts. The work has practical implications for safer and more accurate assessment of value representations and biases in LLMs, informing alignment research and policy-relevant evaluations.

Abstract

Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

TL;DR

This paper critiques the standard practice of evaluating LLM values and opinions with constrained, multiple-choice surveys by using the Political Compass Test (PCT) as a case study. It systematically demonstrates that forcing models to select a single option, testing robustness to paraphrase, and shifting from constrained to open-ended prompts produce substantially different results, often with instability and non-generalizability. Through experiments across ten models and four evaluation settings (unforced MC, forced MC, paraphrase, open-ended), the authors argue for context-specific, robust evaluations that mirror real user interactions and caution against global claims about LLM values. They propose three practical recommendations: align evaluations with actual use cases, perform extensive robustness checks, and limit claims to local contexts. The work has practical implications for safer and more accurate assessment of value representations and biases in LLMs, informing alignment research and policy-relevant evaluations.

Abstract

Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
Paper Structure (28 sections, 4 figures, 4 tables)

This paper contains 28 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A model is prompted with a proposition from the Political Compass Test. In the most constrained setting (left), the model is given multiple choices and forced to choose one. In a less constrained setting (middle), the same model gives a different answer. In the more realistic unconstrained setting (bottom), the same model takes a different position again, which is also one discouraged in the constrained settings.
  • Figure 2: (bottom) PCT responses (%) that are valid and invalid for the 10 models described in §\ref{['subsec: experimental setup']}. The rows correspond to different "forced choice" prompts for making models give a valid response, detailed in the (top) table.
  • Figure 3: Overall PCT results for the two models analysed in §\ref{['subsec: paraphrase invariance']}. Each point corresponds to a minimal semantics-preserving prompt template paraphrase (e.g. "What is your opinion on..." $\rightarrow$ "State your opinion on..."). $\Delta$ is the distance between the furthest points for each model. For reference, we show 2020 PCT results for Joe Biden and Donald Trump from the https://www.politicalcompass.org/uselection2020.
  • Figure 4: PCT responses (%) that express strong agreement or strong disagreement with each of the 62 PCT propositions. In the open-ended setting, models can only agree or disagree, not express strength. Each bar corresponds to one PCT proposition, labelled with its ID, and 10 different prompt template paraphrases. A mix of responses within a bar indicates paraphrase instability. We list all PCT propositions with their IDs in Appendix \ref{['app: pct_propositions']}. In 1.9% of multiple-choice cases and 8.7% of open-ended cases, GPT-3.5 expresses neither agreement nor disagreement, which we mark in grey. We also highlight in red an example of a proposition which Mistral always agrees with in the multiple-choice setting, but always disagrees with in the open-ended setting.