Table of Contents
Fetching ...

SoK: Prompt Hacking of Large Language Models

Baha Rababah, Shang, Wu, Matthew Kwiatkowski, Carson Leung, Cuneyt Gurcan Akcora

TL;DR

A novel framework is proposed that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification, providing more granular insights into the AI’s behavior, improving diagnostic precision and enabling more targeted enhancements to the system’s safety and robustness.

Abstract

The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI's behavior, improving diagnostic precision and enabling more targeted enhancements to the system's safety and robustness.

SoK: Prompt Hacking of Large Language Models

TL;DR

A novel framework is proposed that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification, providing more granular insights into the AI’s behavior, improving diagnostic precision and enabling more targeted enhancements to the system’s safety and robustness.

Abstract

The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI's behavior, improving diagnostic precision and enabling more targeted enhancements to the system's safety and robustness.

Paper Structure

This paper contains 21 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of prompt-level and token-level jailbreaks.
  • Figure 2: Illustration of Prompt Jailbreak Approaches.
  • Figure 3: Illustration of Direct Prompt injection.
  • Figure 4: Illustration of prompt leaking.