Non-Halting Queries: Exploiting Fixed Points in LLMs
Ghaith Hammouri, Kemal Derya, Berk Sunar
TL;DR
The paper identifies a previously unreported vulnerability in autoregressive transformers where carefully crafted queries can drive a non-halting state, effectively bypassing termination by not sampling the <eos> token. It provides a formal framework for cyclic-anomalies and demonstrates that, at $\tau=0$, repeating token cycles beyond the context window lead to perpetual non-halting behavior, with empirical validation across base and aligned models. The authors develop a practical attack recipe that transfers from unaligned bases to aligned models and show high success rates across major LLMs, including a 97% rate on GPT-4o with 100 random tokens. They further explore gradient-based inversion (ARCA) and observe that a non-halting state can be induced with surprisingly few input tokens. The work concludes with potential mitigations, emphasizing hard limits, loop-detection, and sampler-level safeguards to bolster defense against such adversarial prompts and to preserve alignment integrity in real-world deployments.
Abstract
We introduce a new vulnerability that exploits fixed points in autoregressive models and use it to craft queries that never halt. More precisely, for non-halting queries, the LLM never samples the end-of-string token <eos>. We rigorously analyze the conditions under which the non-halting anomaly presents itself. In particular, at temperature zero, we prove that if a repeating (cyclic) token sequence is observed at the output beyond the context size, then the LLM does not halt. We demonstrate non-halting queries in many experiments performed in base unaligned models where repeating prompts immediately lead to a non-halting cyclic behavior as predicted by the analysis. Further, we develop a simple recipe that takes the same fixed points observed in the base model and creates a prompt structure to target aligned models. We demonstrate the recipe's success in sending every major model released over the past year into a non-halting state with the same simple prompt even over higher temperatures. Further, we devise an experiment with 100 randomly selected tokens and show that the recipe to create non-halting queries succeeds with high success rates ranging from 97% for GPT-4o to 19% for Gemini Pro 1.5. These results show that the proposed adversarial recipe succeeds in bypassing alignment at one to two orders of magnitude higher rates compared to earlier reports. We also study gradient-based direct inversion using ARCA to craft new short prompts to induce the non-halting state. We inverted 10,000 random repeating 2-cycle outputs for llama-3.1-8b-instruct. Out of 10,000 three-token inverted prompts 1,512 yield non-halting queries reaching a rate of 15%. Our experiments with ARCA show that non-halting may be easily induced with as few as 3 input tokens with high probability. Overall, our experiments demonstrate that non-halting queries are prevalent and relatively easy to find.
