Enhancing Adversarial Resistance in LLMs with Recursion
Bryan Li, Sounak Bagchi, Zizhan Wang
TL;DR
This work tackles adversarial prompting and jailbreaking risks in large language models by introducing a recursive prompt-simplification framework that translates complex prompts into simplest equivalents. The method adds a verification layer that assesses a safe, simplified prompt before revealing the original answer, aiming to improve detection of malicious intent without sacrificing utility. It situates the approach among existing defenses (adversarial training, gradient masking, ensembles, certified defenses, and input transformations) and argues for a scalable, adaptable solution in a rapidly evolving threat landscape. The proposed framework has practical implications for AI safety and governance as LLMs become more pervasive, addressing the need for robust, low-latency defenses against increasingly sophisticated adversarial prompts.
Abstract
The increasing integration of Large Language Models (LLMs) into society necessitates robust defenses against vulnerabilities from jailbreaking and adversarial prompts. This project proposes a recursive framework for enhancing the resistance of LLMs to manipulation through the use of prompt simplification techniques. By increasing the transparency of complex and confusing adversarial prompts, the proposed method enables more reliable detection and prevention of malicious inputs. Our findings attempt to address a critical problem in AI safety and security, providing a foundation for the development of systems able to distinguish harmless inputs from prompts containing malicious intent. As LLMs continue to be used in diverse applications, the importance of such safeguards will only grow.
