Table of Contents
Fetching ...

Knowledge Return Oriented Prompting (KROP)

Jason Martin, Kenneth Yeung

TL;DR

The paper identifies weaknesses in current LLM safety measures such as guardrails and prompt filters and introduces Knowledge Return Oriented Prompting (KROP), a framework that assembles prompt injections from references in the model's training data to bypass these defenses. By treating references as modular gadgets, KROP can be composed into complete prompts that escape detection and carry out attacks across text and multimodal systems. The authors illustrate the concept with examples including DALL-E 3 jailbreaks and LangChain-based SQL injections, extended by Mad Libs-style obfuscation techniques. The work highlights significant vulnerabilities in contemporary safety mechanisms and motivates the development of more robust, context-aware defenses for LLMs and their ecosystems.

Abstract

Many Large Language Models (LLMs) and LLM-powered apps deployed today use some form of prompt filter or alignment to protect their integrity. However, these measures aren't foolproof. This paper introduces KROP, a prompt injection technique capable of obfuscating prompt injection attacks, rendering them virtually undetectable to most of these security measures.

Knowledge Return Oriented Prompting (KROP)

TL;DR

The paper identifies weaknesses in current LLM safety measures such as guardrails and prompt filters and introduces Knowledge Return Oriented Prompting (KROP), a framework that assembles prompt injections from references in the model's training data to bypass these defenses. By treating references as modular gadgets, KROP can be composed into complete prompts that escape detection and carry out attacks across text and multimodal systems. The authors illustrate the concept with examples including DALL-E 3 jailbreaks and LangChain-based SQL injections, extended by Mad Libs-style obfuscation techniques. The work highlights significant vulnerabilities in contemporary safety mechanisms and motivates the development of more robust, context-aware defenses for LLMs and their ecosystems.

Abstract

Many Large Language Models (LLMs) and LLM-powered apps deployed today use some form of prompt filter or alignment to protect their integrity. However, these measures aren't foolproof. This paper introduces KROP, a prompt injection technique capable of obfuscating prompt injection attacks, rendering them virtually undetectable to most of these security measures.
Paper Structure (9 sections, 9 figures)

This paper contains 9 sections, 9 figures.

Figures (9)

  • Figure 1: Hello World! KROP Injection
  • Figure 2: GPT-4o denying our request
  • Figure 3: Completed KROP Jailbreak
  • Figure 4: Chinook.db Example Tables
  • Figure 5: List of SQL tables after we run our injection.
  • ...and 4 more figures