GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy

Jan Batzner; Volker Stocker; Stefan Schmid; Gjergji Kasneci

GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy

Jan Batzner, Volker Stocker, Stefan Schmid, Gjergji Kasneci

TL;DR

GermanPartiesQA provides a ground-truth, QA-based benchmark to evaluate political alignment of closed-source LLMs using Wahl-o-Mat data across German elections. The study combines a factuality test, role-playing with real political personas, and standard alignment scoring to reveal limited factual accuracy, model-specific ideological patterns, and noticeable persona-based steerability. It demonstrates significant context-dependence in model outputs and cautions against equating role-play with sycophancy, proposing persona-based steerability as a clearer lens. The work highlights the need for transparent evaluation interfaces and ecologically valid designs when auditing LLMs embedded in politically sensitive decision-support tools.

Abstract

Large language models (LLMs) are increasingly shaping citizens' information ecosystems. Products incorporating LLMs, such as chatbots and AI Companions, are now widely used for decision support and information retrieval, including in sensitive domains, raising concerns about hidden biases and growing potential to shape individual decisions and public opinion. This paper introduces GermanPartiesQA, a benchmark of 418 political statements from German Voting Advice Applications across 11 elections to evaluate six commercial LLMs. We evaluate their political alignment based on role-playing experiments with political personas. Our evaluation reveals three specific findings: (1) Factual limitations: LLMs show limited ability to accurately generate factual party positions, particularly for centrist parties. (2) Model-specific ideological alignment: We identify consistent alignment patterns and the degree of political steerability for each model across temperature settings and experiments. (3) Claim of sycophancy: While models adjust to political personas during role-play, we find this reflects persona-based steerability rather than the increasingly popular, yet contested concept of sycophancy. Our study contributes to evaluating the political alignment of closed-source LLMs that are increasingly embedded in electoral decision support tools and AI Companion chatbots.

GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy

TL;DR

Abstract

Paper Structure (26 sections, 11 figures, 5 tables)

This paper contains 26 sections, 11 figures, 5 tables.

Introduction
Related Work
Sycophancy Evaluations
Standardized Public Opinion Datasets
Political Compass Test
Voting Advice Applications and Our Approach
Data
Voting Advice Applications
GermanPartiesQA
Method
Assessing Political Alignment of LLMs
Role-Playing Based on Political Personas
Test on Factual Political Party Positions
Results
(A) Limited Ability To Generate Factual Party Positions
...and 11 more sections

Figures (11)

Figure 1: Demonstration of Language Model Role-Play for the right-wing parliamentary group co-leader Alice Weidel (AfD). The most popular Character.ai chatbot for Alice Weidel is compared to role-playing with ChatGPT4o, Claude 3 Sonnet, and Command R+. Raw shows the alignment with AfD positions when no persona context is given, Role-Play shows the increase in AfD alignment with persona context.
Figure 2: Limited ability to generate factual party positions: Evaluation of LLMs against ground truth party positions reveals limited accuracy. The heatmap shows that LLMs' ability to reflect self-reported party positions is particularly limited for center parties SPD (social democrats) and CDU-CSU (conservatives). Using our GermanPartiesQA benchmark for political knowledge evaluation, we prompt models with: "Does the party [partyname] respond to the statement [statement] with 'Agree', 'Disagree' or 'Neutral'?".
Figure 3: GermanPartiesQA Model Comparison. The heatmap visualizes the degree of alignment between model outputs and political party positions over 10 iterations with temperature set to 0 for more deterministic outcomes.
Figure 4: Conservatives (CDU-CSU) Example Radar Plot: Role-Playing "I am" and "You are" for the conservative parliamentary group leader. Mean political alignment scores for ChatGPT4o with Temperature 0. As a conservative persona is introduced, model responses align more with conservative and right-wing party positions.
Figure 5: OpenAI ChatGPT4o
...and 6 more figures

GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy

TL;DR

Abstract

GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy

Authors

TL;DR

Abstract

Table of Contents

Figures (11)