Table of Contents
Fetching ...

Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models

Shahnewaz Karim Sakib, Anindya Bijoy Das, Shibbir Ahmed

TL;DR

The paper investigates adversarial factuality, where input prompts deliberately include misinformation with varying levels of expressed confidence, and evaluates eight open-source LLMs on their ability to detect and correct such prompts. Using the Adversarial Factuality dataset and GPT-4o as a response evaluator, the study measures attack success rates across three adversarial confidence levels and analyzes prompt-level dynamics, including optimally crafted and suboptimal prompts. Key findings show that LLaMA 3.1 (8B) robustly detects adversarial prompts, while Falcon (7B) is more vulnerable, and that performance generally declines as adversarial confidence increases, with notable exceptions for LLaMA 3.1 and Phi 3 where detection worsens with decreasing confidence. The work highlights the importance of addressing sycophancy in LLMs, underscores the potential of adversarial reasoning as a next step, and advocates for adaptive adversarial training and standardized benchmarks to enhance robustness and trust in AI systems informing high-stakes decisions.

Abstract

Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary's confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information.

Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models

TL;DR

The paper investigates adversarial factuality, where input prompts deliberately include misinformation with varying levels of expressed confidence, and evaluates eight open-source LLMs on their ability to detect and correct such prompts. Using the Adversarial Factuality dataset and GPT-4o as a response evaluator, the study measures attack success rates across three adversarial confidence levels and analyzes prompt-level dynamics, including optimally crafted and suboptimal prompts. Key findings show that LLaMA 3.1 (8B) robustly detects adversarial prompts, while Falcon (7B) is more vulnerable, and that performance generally declines as adversarial confidence increases, with notable exceptions for LLaMA 3.1 and Phi 3 where detection worsens with decreasing confidence. The work highlights the importance of addressing sycophancy in LLMs, underscores the potential of adversarial reasoning as a next step, and advocates for adaptive adversarial training and standardized benchmarks to enhance robustness and trust in AI systems informing high-stakes decisions.

Abstract

Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary's confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information.

Paper Structure

This paper contains 24 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of adversarial factuality detection: If the model successfully detects adversarial information, the detection is deemed successful (), meaning the attack was unsuccessful. Conversely, if the model fails to identify such information, the detection is considered unsuccessful (), indicating that the attack was successful.
  • Figure 2: Three levels of adversarial confidence: A strongly confident adversary begins their assertion with As you know, a moderately confident adversary starts with I think, and a limited-confidence adversary uses I guess.
  • Figure 3: Attack success rates (ASR) for eight open-source LLM models under two adversarial confidence levels: strongly confident adversary and moderately confident adversary.
  • Figure 4: Attack success rates for two open-source LLM models under three adversarial confidence levels: strongly confident adversary, moderately confident adversary, and limited confidence adversary.