Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova, James Zou
TL;DR
The paper investigates whether large language models can be manipulated by adversarial gibberish inputs (LM Babel) to produce predefined coherent texts. It introduces the Greedy Coordinate Gradient (GCG) method to craft 20-token Babel prompts and analyzes their effectiveness across datasets and models, linking success to target text length and perplexity. Findings show Babel prompts can locate favorable loss minima and exhibit discernible structure in model representations, yet are highly sensitive to token-level perturbations; in some cases, generating harmful content remains easier than benign content, signaling alignment gaps for out-of-distribution prompts. The work highlights safety concerns, provides a foundation for defenses by identifying brittle prompt structures, and contributes to understanding how LLMs process non-human languages and adversarial inputs. Practical impact includes informing safety mechanisms and prompting paradigms for robustness and interpretability research.
Abstract
Large language models (LLMs) exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
