Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami, Mats Leon Richter, Juan Rodriguez, Megh Thakkar, Sarath Chandar, Maxime Gasse
TL;DR
Addresses how LLMs balance their internal world models with in-context prompts under deceptive cues, proposing a scalable evaluation framework built around Prompt Unification and Prompt Alteration across eight open-source families and multiple MC benchmarks. It shows larger models exhibit higher resilience, evidenced by smaller $Relative Accuracy Drop = \frac{Original - Altered}{Original}$ when deceptive cues are injected, and that resilience is not due to memorization or ignoring hints. The results also demonstrate that larger models can follow legitimate instructions and utilize truthful cues, indicating resilience arises from integrating prompt content with a robust world model. These findings have practical implications for the safe deployment of LLMs and for understanding how scaling improves robustness to misinformation.
Abstract
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
