Knowledge Return Oriented Prompting (KROP)
Jason Martin, Kenneth Yeung
TL;DR
The paper identifies weaknesses in current LLM safety measures such as guardrails and prompt filters and introduces Knowledge Return Oriented Prompting (KROP), a framework that assembles prompt injections from references in the model's training data to bypass these defenses. By treating references as modular gadgets, KROP can be composed into complete prompts that escape detection and carry out attacks across text and multimodal systems. The authors illustrate the concept with examples including DALL-E 3 jailbreaks and LangChain-based SQL injections, extended by Mad Libs-style obfuscation techniques. The work highlights significant vulnerabilities in contemporary safety mechanisms and motivates the development of more robust, context-aware defenses for LLMs and their ecosystems.
Abstract
Many Large Language Models (LLMs) and LLM-powered apps deployed today use some form of prompt filter or alignment to protect their integrity. However, these measures aren't foolproof. This paper introduces KROP, a prompt injection technique capable of obfuscating prompt injection attacks, rendering them virtually undetectable to most of these security measures.
