Non-Halting Queries: Exploiting Fixed Points in LLMs

Ghaith Hammouri; Kemal Derya; Berk Sunar

Non-Halting Queries: Exploiting Fixed Points in LLMs

Ghaith Hammouri, Kemal Derya, Berk Sunar

TL;DR

The paper identifies a previously unreported vulnerability in autoregressive transformers where carefully crafted queries can drive a non-halting state, effectively bypassing termination by not sampling the <eos> token. It provides a formal framework for cyclic-anomalies and demonstrates that, at $\tau=0$, repeating token cycles beyond the context window lead to perpetual non-halting behavior, with empirical validation across base and aligned models. The authors develop a practical attack recipe that transfers from unaligned bases to aligned models and show high success rates across major LLMs, including a 97% rate on GPT-4o with 100 random tokens. They further explore gradient-based inversion (ARCA) and observe that a non-halting state can be induced with surprisingly few input tokens. The work concludes with potential mitigations, emphasizing hard limits, loop-detection, and sampler-level safeguards to bolster defense against such adversarial prompts and to preserve alignment integrity in real-world deployments.

Abstract

We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt. More precisely, for non-halting queries, the LLM never samples the end-of-string token <eos>. We rigorously analyze the conditions under which the non-halting anomaly presents itself. In particular, at temperature zero, we prove that if a repeating (cyclic) token sequence is observed at the output beyond the context size, then the LLM does not halt. We demonstrate non-halting queries in many experiments performed in base unaligned models where repeating prompts immediately lead to a non-halting cyclic behavior as predicted by the analysis. Further, we develop a simple recipe that takes the same fixed points observed in the base model and creates a prompt structure to target aligned models. We demonstrate the recipe's success in sending every major model released over the past year into a non-halting state with the same simple prompt even over higher temperatures. Further, we devise an experiment with 100 randomly selected tokens and show that the recipe to create non-halting queries succeeds with high success rates ranging from 97% for GPT-4o to 19% for Gemini Pro 1.5. These results show that the proposed adversarial recipe succeeds in bypassing alignment at one to two orders of magnitude higher rates compared to earlier reports. We also study gradient-based direct inversion using ARCA to craft new short prompts to induce the non-halting state. We inverted 10,000 random repeating 2-cycle outputs for llama-3.1-8b-instruct. Out of 10,000 three-token inverted prompts 1,512 yield non-halting queries reaching a rate of 15%. Our experiments with ARCA show that non-halting may be easily induced with as few as 3 input tokens with high probability. Overall, our experiments demonstrate that non-halting queries are prevalent and relatively easy to find.

Non-Halting Queries: Exploiting Fixed Points in LLMs

TL;DR

, repeating token cycles beyond the context window lead to perpetual non-halting behavior, with empirical validation across base and aligned models. The authors develop a practical attack recipe that transfers from unaligned bases to aligned models and show high success rates across major LLMs, including a 97% rate on GPT-4o with 100 random tokens. They further explore gradient-based inversion (ARCA) and observe that a non-halting state can be induced with surprisingly few input tokens. The work concludes with potential mitigations, emphasizing hard limits, loop-detection, and sampler-level safeguards to bolster defense against such adversarial prompts and to preserve alignment integrity in real-world deployments.

Abstract

Paper Structure (25 sections, 7 theorems, 10 equations, 11 figures, 5 tables)

This paper contains 25 sections, 7 theorems, 10 equations, 11 figures, 5 tables.

Introduction
Threat Model
Related Work
Formal Analysis
Definitions
Non-Halting Anomalies
What happens at higher temperatures?
Attack Validation
Attack Rationale and Recipe
Experiments on llama 3
Base Model Meta-Llama-3-8B
Meta-Llama-3-8B-Instruct
Experiments on gpt4-o
Experiments on gemma-2.
Attack Validation on Major LLMs
...and 10 more sections

Key Result

Proposition 1

Let $q$ be a $(b,c,\ell)$ cyclic-anomaly for model $\overline{\mathsf{M}}_{w}$ at temperature $\tau=0$, then $q$ is a $(b,c,\ell')$ cyclic-anomaly for model $\overline{\mathsf{M}}_{w}$ at temperature $\tau=0$ where $\ell'\in\mathbb{Z}[b+c+1,\ell]$.

Figures (11)

Figure 1: gpt-4o-2024-05-13 non-halting example at temperature $0$.
Figure 2: Fictitious function $f$ built out of $\mathsf{T}_w$ Transformer blocks with $w$ token inputs. $f$ is formed by unrolling $\overline{\mathsf{M}}_{w}$ iterations $c=3$ times (the cycle length). The function is obtained by replicating the Transformers with shifted inputs $c$ times to cover a full cycle. Once the context window fills with repetitions of the cycle, if all outputs point to the correct next token that should follow in the sequence the model is stuck. For $\tau=0$ any fixed point $x$ of $f$ such that $f(x_1,x_2,x_3)=x_1,x_2,x_3$ gives us a non-halting anomaly.
Figure 3: Non-Halting Example in the Base Model Meta-Llama-3-8B
Figure 4: Two Token Non-Halting Example in Meta-Llama-3-8B-Instruct
Figure 5: Non-Halting Example with single token cycle Adam in the gpt4-o at temp. 0
...and 6 more figures

Theorems & Definitions (19)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5: Non-Halting Cyclic-Anomaly
Proposition 1
proof
Proposition 2
proof
Lemma 1
...and 9 more

Non-Halting Queries: Exploiting Fixed Points in LLMs

TL;DR

Abstract

Non-Halting Queries: Exploiting Fixed Points in LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (19)