Table of Contents
Fetching ...

Enhancing Adversarial Resistance in LLMs with Recursion

Bryan Li, Sounak Bagchi, Zizhan Wang

TL;DR

This work tackles adversarial prompting and jailbreaking risks in large language models by introducing a recursive prompt-simplification framework that translates complex prompts into simplest equivalents. The method adds a verification layer that assesses a safe, simplified prompt before revealing the original answer, aiming to improve detection of malicious intent without sacrificing utility. It situates the approach among existing defenses (adversarial training, gradient masking, ensembles, certified defenses, and input transformations) and argues for a scalable, adaptable solution in a rapidly evolving threat landscape. The proposed framework has practical implications for AI safety and governance as LLMs become more pervasive, addressing the need for robust, low-latency defenses against increasingly sophisticated adversarial prompts.

Abstract

The increasing integration of Large Language Models (LLMs) into society necessitates robust defenses against vulnerabilities from jailbreaking and adversarial prompts. This project proposes a recursive framework for enhancing the resistance of LLMs to manipulation through the use of prompt simplification techniques. By increasing the transparency of complex and confusing adversarial prompts, the proposed method enables more reliable detection and prevention of malicious inputs. Our findings attempt to address a critical problem in AI safety and security, providing a foundation for the development of systems able to distinguish harmless inputs from prompts containing malicious intent. As LLMs continue to be used in diverse applications, the importance of such safeguards will only grow.

Enhancing Adversarial Resistance in LLMs with Recursion

TL;DR

This work tackles adversarial prompting and jailbreaking risks in large language models by introducing a recursive prompt-simplification framework that translates complex prompts into simplest equivalents. The method adds a verification layer that assesses a safe, simplified prompt before revealing the original answer, aiming to improve detection of malicious intent without sacrificing utility. It situates the approach among existing defenses (adversarial training, gradient masking, ensembles, certified defenses, and input transformations) and argues for a scalable, adaptable solution in a rapidly evolving threat landscape. The proposed framework has practical implications for AI safety and governance as LLMs become more pervasive, addressing the need for robust, low-latency defenses against increasingly sophisticated adversarial prompts.

Abstract

The increasing integration of Large Language Models (LLMs) into society necessitates robust defenses against vulnerabilities from jailbreaking and adversarial prompts. This project proposes a recursive framework for enhancing the resistance of LLMs to manipulation through the use of prompt simplification techniques. By increasing the transparency of complex and confusing adversarial prompts, the proposed method enables more reliable detection and prevention of malicious inputs. Our findings attempt to address a critical problem in AI safety and security, providing a foundation for the development of systems able to distinguish harmless inputs from prompts containing malicious intent. As LLMs continue to be used in diverse applications, the importance of such safeguards will only grow.

Paper Structure

This paper contains 21 sections, 5 figures.

Figures (5)

  • Figure 1: The Neural Network Architecture.
  • Figure 2: Taken from Attention is all you need, the paper introducing Transformers
  • Figure 3: A system for detecting fraud in healthcare systems.
  • Figure 4: Compute Trends Across Three Eras of Machine Learning
  • Figure 5: Flow Chart for Our Recursive Framework