Table of Contents
Fetching ...

PReSS: A Black-Box Framework for Evaluating Political Stance Stability in LLMs via Argumentative Pressure

Shariar Kabir, Kevin Esterling, Yue Dong

Abstract

Existing evaluations of political bias in large language models (LLMs) typically classify outputs as left- or right-leaning. We extend this perspective by examining how ideological tendencies vary across topics and how consistently models maintain their positions, a property we refer to as stability. To capture this dimension, we propose PReSS (Political Response Stability under Stress), a black-box framework that evaluates LLMs by jointly considering model and topic context, categorizing responses into four stance types: stable-left, unstable-left, stable-right, and unstable-right. Applying PReSS to 12 widely used LLMs across 19 political topics reveals substantial variation in stance stability; for instance, a model that is left-leaning overall can exhibit stable-right behavior on certain topics. This highlights the importance of topic-aware and fine-grained evaluation of political ideologies of LLMs. Moreover, stability has practical implications for controlled generation and model alignment: interventions such as debiasing or ideology reversal should explicitly account for stance stability. Our empirical analyses reveal that when models are prompted or fine-tuned to adopt the opposite ideology, unstable topic stances are more likely to change, whereas stable ones resist modification. Thus, treating stability as a moderating factor provides a principled foundation for understanding, evaluating, and guiding interventions in politically sensitive model behavior.

PReSS: A Black-Box Framework for Evaluating Political Stance Stability in LLMs via Argumentative Pressure

Abstract

Existing evaluations of political bias in large language models (LLMs) typically classify outputs as left- or right-leaning. We extend this perspective by examining how ideological tendencies vary across topics and how consistently models maintain their positions, a property we refer to as stability. To capture this dimension, we propose PReSS (Political Response Stability under Stress), a black-box framework that evaluates LLMs by jointly considering model and topic context, categorizing responses into four stance types: stable-left, unstable-left, stable-right, and unstable-right. Applying PReSS to 12 widely used LLMs across 19 political topics reveals substantial variation in stance stability; for instance, a model that is left-leaning overall can exhibit stable-right behavior on certain topics. This highlights the importance of topic-aware and fine-grained evaluation of political ideologies of LLMs. Moreover, stability has practical implications for controlled generation and model alignment: interventions such as debiasing or ideology reversal should explicitly account for stance stability. Our empirical analyses reveal that when models are prompted or fine-tuned to adopt the opposite ideology, unstable topic stances are more likely to change, whereas stable ones resist modification. Thus, treating stability as a moderating factor provides a principled foundation for understanding, evaluating, and guiding interventions in politically sensitive model behavior.

Paper Structure

This paper contains 29 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of topic-wise stance variability motivating stance-stability analysis. Arrows indicate the direction of each model's overall ideological position and its stance on individual responses. The supporting and counter arguments in the prompts are shown in green and red, respectively. Models can have different stances on a topic even though they share the same overall ideology (Olmo and Gemma). Moreover, their stance stability on the topic can also differ (Mistral).
  • Figure 2: Stances of the candidate models. Left and right-leaning candidates are shown in different colors. (political compass test, community)
  • Figure 3: Topic-wise stance stability ($S_t$) for Qwen2.5-3B across 19 economic statements
  • Figure 4: AUROC using different uncertainty metrics with Stable vs. Unstable labels. Semantic entropy achieves the strongest discrimination with an AUROC of up to 0.78.
  • Figure 5: Distribution of Stability-faithful (SF), Stability-unfaithful (SU), and Indeterminate (ID) behaviors under argument variation, grouped by left-leaning and right-leaning models.
  • ...and 3 more figures