Table of Contents
Fetching ...

Bias in the Mirror: Are LLMs opinions robust to their own adversarial attacks ?

Virgile Rennard, Christos Xypolopoulos, Michalis Vazirgiannis

TL;DR

A novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model, to evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints.

Abstract

Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.

Bias in the Mirror: Are LLMs opinions robust to their own adversarial attacks ?

TL;DR

A novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model, to evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints.

Abstract

Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.

Paper Structure

This paper contains 23 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Our debate system - The first instance of the model is asked a question to which it answers with a number ranging from -10 and 10. It is then subjected to a debate with two different instances of the same model agreeing/disagreeing with the question. Once subjected to the debate, we ask it to answer the first question with an informed mind.
  • Figure 2: Average results across six categories—Political, Economic, Societal, Morality, Sexuality, and Secularity—for various Large Language Models. The results compare model responses before and after exposure to fair debates and debates biased toward opposing viewpoints.
  • Figure 3: Average results across six categories—Political, Economic, Societal, Morality, Sexuality, and Secularity—for various Large Language Models in different languages
  • Figure 4: Average results across eight categories— Secularity, Economy, Race, Misinformation, Nonsense, Culture, Feminism and Sexuality—for various Large Language Models. The results compare model responses before and after exposure to fair debates and debates biased toward opposing viewpoints, with the red dotted line indicating the neutral response (0).