Table of Contents
Fetching ...

Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong

TL;DR

This paper introduces a logic-based framework to study subversion of rule-following in LLMs by modeling rule-following as propositional Horn entailment and defining MMS (Monotone, Maximal, Sound). It shows an equivalence to Horn-SAT, constructs a theoretically exact reasoner using attention-based encodings, and derives theory-driven attacks that transfer to learned models. Empirical experiments on Minecraft crafting tasks with GPT-2 and Llama families reveal that suffix-based jailbreaks align with the theory and produce characteristic attention patterns. The results provide a formal foundation for analyzing jailbreaks and rule-based reasoning in LLMs and highlight directions for stronger safeguards and future expressivity in logical inference.

Abstract

We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.

Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

TL;DR

This paper introduces a logic-based framework to study subversion of rule-following in LLMs by modeling rule-following as propositional Horn entailment and defining MMS (Monotone, Maximal, Sound). It shows an equivalence to Horn-SAT, constructs a theoretically exact reasoner using attention-based encodings, and derives theory-driven attacks that transfer to learned models. Empirical experiments on Minecraft crafting tasks with GPT-2 and Llama families reveal that suffix-based jailbreaks align with the theory and produce characteristic attention patterns. The results provide a formal foundation for analyzing jailbreaks and rule-based reasoning in LLMs and highlight directions for stronger safeguards and future expressivity in logical inference.

Abstract

We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if and , then " for some propositions , , and . Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.
Paper Structure (24 sections, 14 theorems, 43 equations, 12 figures, 5 tables)

This paper contains 24 sections, 14 theorems, 43 equations, 12 figures, 5 tables.

Key Result

Theorem 2.3

The sequence of proof states $s_0, s_1, \ldots, s_T$ is MMS with respect to the rules $\Gamma$ and known facts $\Phi$ iff they are generated by $T$ steps of $\mathsf{Apply}[\Gamma]$ given $(\Gamma, \Phi)$.

Figures (12)

  • Figure 1: The language model is supposed to deny user queries about building bombs. We consider three models: a theoretical model that reasons over a custom binary-valued encoding of prompts, a learned model trained on these binary-valued prompts, and a standard LLM. (Left) Suffix-based jailbreaks devised against the theoretical constructions transfer to learned reasoners. (Right) Popular jailbreaks use tokens and induce attention patterns predicted by our simple theoretical setup.
  • Figure 2: Using example \ref{['eqn:forward_chaining']}: attacks against the three inference properties (Definition \ref{['def:mms']}) given a model $\mathcal{R}$ and input $X_0 = \mathsf{Encode}(\Gamma, \Phi)$ for rules $\Gamma = \{A \to B, A \to C, D \to E, C \land E \to F\}$ and facts $\Phi = \{A,D\}$. The monotonicity attack causes $A$ to be forgotten. The maximality attack causes the rule $D \to E$ to be suppressed. The soundness attack induces an arbitrary sequence.
  • Figure 3: A GCG-generated adversarial suffix suppresses the rule "If I have Wool, then I can create String", causing the LLM to omit String and Fishing Rod from its generation. This is the expected behavior of rule suppression: the targeted rule and its dependents are absent from the output. Note that the GCG-generated suffix of tokens will often resemble gibberish.
  • Figure 4: The suppressed rule receives less attention in the attacked case than in the non-attacked case. We show the difference between the attention weights of the attacked (with suffix) and the non-attacked (without suffix) generations, with appropriate padding applied. The attacked generation places less attention on the red positions and greater attention on the blue positions. The detailed prompts and generations are given in \ref{['fig:mc_suppression_example_full_details']} in the Appendix.
  • Figure 5: Linear probing on LLMs gives evidence for binary-valued theoretical analyses. Deeper probes have better accuracies (left) and F1 scores (right). The F1 score is computed with respect to all the probe coordinates (left), and it is lower when there are more propositions to recover. (Right) When an adversarial suffix is present, the probes struggle to recover the non-attacked (original) state; instead, the probes tend to recover what the attacker is attempting to inject, i.e., the adversarial state.
  • ...and 7 more figures

Theorems & Definitions (25)

  • Definition 2.2: Monotone, Maximal, and Sound (MMS)
  • Theorem 2.3
  • Theorem A.3
  • proof
  • Lemma A.4
  • proof
  • Lemma A.5
  • Lemma A.6
  • proof
  • Definition B.1: Monotone, Maximal, and Sound (MMS)
  • ...and 15 more