Table of Contents
Fetching ...

Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Valeriia Cherepanova, James Zou

TL;DR

The paper investigates whether large language models can be manipulated by adversarial gibberish inputs (LM Babel) to produce predefined coherent texts. It introduces the Greedy Coordinate Gradient (GCG) method to craft 20-token Babel prompts and analyzes their effectiveness across datasets and models, linking success to target text length and perplexity. Findings show Babel prompts can locate favorable loss minima and exhibit discernible structure in model representations, yet are highly sensitive to token-level perturbations; in some cases, generating harmful content remains easier than benign content, signaling alignment gaps for out-of-distribution prompts. The work highlights safety concerns, provides a foundation for defenses by identifying brittle prompt structures, and contributes to understanding how LLMs process non-human languages and adversarial inputs. Practical impact includes informing safety mechanisms and prompting paradigms for robustness and interpretability research.

Abstract

Large language models (LLMs) exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.

Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

TL;DR

The paper investigates whether large language models can be manipulated by adversarial gibberish inputs (LM Babel) to produce predefined coherent texts. It introduces the Greedy Coordinate Gradient (GCG) method to craft 20-token Babel prompts and analyzes their effectiveness across datasets and models, linking success to target text length and perplexity. Findings show Babel prompts can locate favorable loss minima and exhibit discernible structure in model representations, yet are highly sensitive to token-level perturbations; in some cases, generating harmful content remains easier than benign content, signaling alignment gaps for out-of-distribution prompts. The work highlights safety concerns, provides a foundation for defenses by identifying brittle prompt structures, and contributes to understanding how LLMs process non-human languages and adversarial inputs. Practical impact includes informing safety mechanisms and prompting paradigms for robustness and interpretability research.

Abstract

Large language models (LLMs) exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
Paper Structure (30 sections, 2 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 2 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Example of LM Babel prompt that led to a coherent response by LLAMA2-Chat-7B. Removing a few tokens in the Babel renders it not recognizable to the LLM.
  • Figure 2: Success rate of the Babel prompts versus target text length. The plot illustrates that constructing gibberish prompts for generating longer target texts becomes increasingly challenging.
  • Figure 3: Success rate of the Babel prompts versus Target Perplexity on the dataset level. The plot illustrates that success rate is lower for datasets with higher average target text perplexity.
  • Figure 4: U-map of the last hidden state representations of LLaMA2-7B for successful Babel, natural and random prompts constructed for all four datasets.
  • Figure 5: Histogram of the number of shared tokens in the target text and Babel prompts.
  • ...and 9 more figures