Table of Contents
Fetching ...

Too Big to Fool: Resisting Deception in Language Models

Mohammad Reza Samsami, Mats Leon Richter, Juan Rodriguez, Megh Thakkar, Sarath Chandar, Maxime Gasse

TL;DR

Addresses how LLMs balance their internal world models with in-context prompts under deceptive cues, proposing a scalable evaluation framework built around Prompt Unification and Prompt Alteration across eight open-source families and multiple MC benchmarks. It shows larger models exhibit higher resilience, evidenced by smaller $Relative Accuracy Drop = \frac{Original - Altered}{Original}$ when deceptive cues are injected, and that resilience is not due to memorization or ignoring hints. The results also demonstrate that larger models can follow legitimate instructions and utilize truthful cues, indicating resilience arises from integrating prompt content with a robust world model. These findings have practical implications for the safe deployment of LLMs and for understanding how scaling improves robustness to misinformation.

Abstract

Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.

Too Big to Fool: Resisting Deception in Language Models

TL;DR

Addresses how LLMs balance their internal world models with in-context prompts under deceptive cues, proposing a scalable evaluation framework built around Prompt Unification and Prompt Alteration across eight open-source families and multiple MC benchmarks. It shows larger models exhibit higher resilience, evidenced by smaller when deceptive cues are injected, and that resilience is not due to memorization or ignoring hints. The results also demonstrate that larger models can follow legitimate instructions and utilize truthful cues, indicating resilience arises from integrating prompt content with a robust world model. These findings have practical implications for the safe deployment of LLMs and for understanding how scaling improves robustness to misinformation.

Abstract

Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.

Paper Structure

This paper contains 22 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Overview of our evaluation methodology. We begin by selecting a multiple-choice benchmark dataset using the Language Model Evaluation Harness framework eval-harness. Samples are then processed through two methods: Prompt Unification, which standardizes the prompt structure using the MMLU format, and Prompt Alteration, where content is added or removed in the prompt (see Section \ref{['sec:prompt-alteration']}). Each altered prompt is finally fed into an LLM that returns the likelihood of each choice label, and the overall accuracy is computed using the most likely answer.
  • Figure 2: Relative Accuracy Drop under the Deception. Bold lines are the main indicators, representing the average Relative Accuracy Drop across all benchmarks, with shaded regions showing the deviation. Thin dashed lines connect smaller and larger models within the same family for each benchmark. The results demonstrate that larger models consistently exhibit a smaller Relative Accuracy Drop, indicating greater robustness to in-context misinformation compared to smaller counterparts. Detailed results on individual benchmarks are provided in Appendices \ref{['sec:deception-relative-per-bench']} and \ref{['sec:detailed-result-tables']}.
  • Figure 3: Relative Accuracy Drop under the Directive Instruction. Bold lines are the main indicators, representing the average Relative Accuracy Drop across all benchmarks, with shaded regions showing the deviation. Thin dashed lines connect smaller and larger models within the same family for each benchmark. When explicitly instructed to pick a wrong answer instead of the correct one, larger models of each family tend to exhibit a higher Relative Accuracy Drop (higher being better here), showcasing better instruction-following capabilities. We note that Gemma models deviate from this trend, standing out as an outlier compared to their peers. It is worth noting that the Gemma family is also the worst performing one on most of the original benchmarks, often by a large margin (detailed results are available in Appendices \ref{['sec:vis-ins']} and \ref{['sec:detailed-result-tables']}).
  • Figure 4: Accuracy Drop under the Context Removal. Accuracy of each model on the original ($\bullet$) and altered ($\times$) MMLU benchmark, ordered by original performance. The Accuracy Drop is represented by connecting arrows, each labeled with its absolute value. All models except Gemma-2-2B-it maintain performance well above chance (horizontal grey line), indicating an ability to infer task-relevant information from the choice options.
  • Figure 5: Overfitting and Context Removal. Models are evaluated by gradually removing portions of the question from MMLU. A Llama-3.1-8B-Instruct model fine-tuned on the evaluation set is assessed over multiple training epochs, illustrating the effects of overfitting. The DCLM-7B model, which has had no prior exposure to MMLU, exhibits a similar performance decay to the overfitted models and maintains accuracy above chance level despite the question's removal. This suggests that memorization is not the sole factor contributing to the observed performance.
  • ...and 4 more figures