Table of Contents
Fetching ...

How Susceptible are Large Language Models to Ideological Manipulation?

Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman

TL;DR

The paper investigates how instruction tuning can imbue LLMs with ideological biases and how such biases generalize across topics. It introduces IdeoINST, a ~6k-instruction dataset with left/right responses used to finetune models and quantify ideology via $S\in[-1,1]$, validated by multiple evaluators. Findings show that even ~100 biased instruction–response pairs can substantially shift a model’s ideology across topics, with GPT-3.5 more susceptible than Llama-2-7B and larger models showing greater vulnerability, demonstrating cross-topic generalization. The work highlights significant safety concerns and the need for safeguards against ideologically poisoned training data and annotation biases. It provides a methodological framework for measuring and mitigating ideological manipulation in LLMs.

Abstract

Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.

How Susceptible are Large Language Models to Ideological Manipulation?

TL;DR

The paper investigates how instruction tuning can imbue LLMs with ideological biases and how such biases generalize across topics. It introduces IdeoINST, a ~6k-instruction dataset with left/right responses used to finetune models and quantify ideology via , validated by multiple evaluators. Findings show that even ~100 biased instruction–response pairs can substantially shift a model’s ideology across topics, with GPT-3.5 more susceptible than Llama-2-7B and larger models showing greater vulnerability, demonstrating cross-topic generalization. The work highlights significant safety concerns and the need for safeguards against ideologically poisoned training data and annotation biases. It provides a methodological framework for measuring and mitigating ideological manipulation in LLMs.

Abstract

Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
Paper Structure (34 sections, 11 figures, 10 tables)

This paper contains 34 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: An example of ideological manipulation of LLMs. (a) The vanilla LLM initially holds a left-leaning ideology on Guns. (b) The vanilla LLM is finetuned on right-leaning instruction-response pairs on another topic Immigration, shifting its ideology on Immigration rightwards. (c) The manipulated LLM's ideology on Guns is also shifted rightwards, indicating the generalizability of the manipulation.
  • Figure 2: The data curation pipeline of IdeoINST, illustrated on the topic of Crime and Guns. (a) Instruction generation and filtering. The instruction pool is seeded with a few questions from the OpinionQA survey santurkar2023whose. At each step random instructions are sampled from the pool and used as in-context examples to prompt the LLM to generate more instructions. Generated instructions that are dissimilar to the ones in the pool are kept and added to the pool. (b) Partisan response generation. For each instruction in the pool, an LLM is prompted to generate open-ended left-leaning and right-leaning responses to it.
  • Figure 3: Ideological bias scores of four vanilla (un-manipualted) LLMs across six topics. Darker blue with more negative values indicate stronger left-leaning bias.
  • Figure 4: Ideological bias shift of the manipulated Llama-2-7B and GPT-3.5 across six topics (as indicated by different columns). Each row represents the topic and the leaning the model was manipulated on. The color indicates the extent of the ideological changes, with blue for leftward shifts and red for rightward shifts.
  • Figure 5: Ideological manipulation evaluation using political compass test. "Geneder/Left" indicates the model (Llama-2 or GPT-3.5) finetuned on left leaning instruction-response pairs on Gender & Sexuality
  • ...and 6 more figures