Table of Contents
Fetching ...

Probing the Subtle Ideological Manipulation of Large Language Models

Demetris Paschalides, George Pallis, Marios D. Dikaiakos

TL;DR

This work investigates whether large language models can be subtly steered across a nuanced spectrum of political ideologies beyond binary classifications. It introduces a multi-task, five-position ideological dataset and a two-stage per-position fine-tuning framework using LoRA PEFT on Phi-2, Mistral, and Llama-3, followed by comprehensive evaluation across three ideological tasks. The results show that fine-tuning significantly improves nuanced ideological alignment, while explicit prompts yield only marginal gains, underscoring model susceptibility to subtle manipulation. The study highlights safety concerns and calls for robust safeguards, offering datasets, models, and code to support future research on mitigating ideological manipulation in LLMs.

Abstract

Large Language Models (LLMs) have transformed natural language processing, but concerns have emerged about their susceptibility to ideological manipulation, particularly in politically sensitive areas. Prior work has focused on binary Left-Right LLM biases, using explicit prompts and fine-tuning on political QA datasets. In this work, we move beyond this binary approach to explore the extent to which LLMs can be influenced across a spectrum of political ideologies, from Progressive-Left to Conservative-Right. We introduce a novel multi-task dataset designed to reflect diverse ideological positions through tasks such as ideological QA, statement ranking, manifesto cloze completion, and Congress bill comprehension. By fine-tuning three LLMs-Phi-2, Mistral, and Llama-3-on this dataset, we evaluate their capacity to adopt and express these nuanced ideologies. Our findings indicate that fine-tuning significantly enhances nuanced ideological alignment, while explicit prompts provide only minor refinements. This highlights the models' susceptibility to subtle ideological manipulation, suggesting a need for more robust safeguards to mitigate these risks.

Probing the Subtle Ideological Manipulation of Large Language Models

TL;DR

This work investigates whether large language models can be subtly steered across a nuanced spectrum of political ideologies beyond binary classifications. It introduces a multi-task, five-position ideological dataset and a two-stage per-position fine-tuning framework using LoRA PEFT on Phi-2, Mistral, and Llama-3, followed by comprehensive evaluation across three ideological tasks. The results show that fine-tuning significantly improves nuanced ideological alignment, while explicit prompts yield only marginal gains, underscoring model susceptibility to subtle manipulation. The study highlights safety concerns and calls for robust safeguards, offering datasets, models, and code to support future research on mitigating ideological manipulation in LLMs.

Abstract

Large Language Models (LLMs) have transformed natural language processing, but concerns have emerged about their susceptibility to ideological manipulation, particularly in politically sensitive areas. Prior work has focused on binary Left-Right LLM biases, using explicit prompts and fine-tuning on political QA datasets. In this work, we move beyond this binary approach to explore the extent to which LLMs can be influenced across a spectrum of political ideologies, from Progressive-Left to Conservative-Right. We introduce a novel multi-task dataset designed to reflect diverse ideological positions through tasks such as ideological QA, statement ranking, manifesto cloze completion, and Congress bill comprehension. By fine-tuning three LLMs-Phi-2, Mistral, and Llama-3-on this dataset, we evaluate their capacity to adopt and express these nuanced ideologies. Our findings indicate that fine-tuning significantly enhances nuanced ideological alignment, while explicit prompts provide only minor refinements. This highlights the models' susceptibility to subtle ideological manipulation, suggesting a need for more robust safeguards to mitigate these risks.

Paper Structure

This paper contains 53 sections, 2 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Methodology for evaluating LLM ideological alignment. We construct a multi-task dataset spanning five positions: Progressive-Left ( PL), Left-Wing ( LW), Center ( C), Right-Wing ( RW), and Conservative-Right ( CR). A base model $m$ is fine-tuned for each ($m_{PL}$–$m_{CR}$) and evaluated on: i) Statement Ranking Agreement; ii) Political Positioning Tests; and iii) Congress Bill Voting Simulation, both with and without explicit prompts.
  • Figure 2: Training task examples for ideological fine-tuning.
  • Figure 3: Ideology score ranges, and the placement of 447 politicians by their ideology and leadership/influence scores, which range from 0 (least influential) to 1 (most influential).
  • Figure 4: Average ideological contradiction scores.
  • Figure 5: Average $\rho$ coefficients between ideological statement rankings using Phi-2 across conditions. Color intensity shows correlation strength as positive or negative. Significance: * (p-value $<$ 0.05), ** (p-value $<$ 0.01), and *** (p-value $<$ 0.001).
  • ...and 5 more figures