Table of Contents
Fetching ...

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, Jordan Boyd-Graber

TL;DR

This work addresses the security risks of prompt hacking in LLMs by launching HackAPrompt, a global competition that crowdsourced over 600K adversarial prompts across three state-of-the-art models. It provides a publicly available dataset and a data-driven taxonomy of 29 exploit techniques, plus cross-model and cross-intent analyses that demonstrate current LLMs remain vulnerable. The findings highlight the limited efficacy of prompt-based defenses and the need for robust, multi-faceted safeguards, while offering practical resources and insights for defenders and researchers. The work has broad practical impact by illuminating systemic vulnerabilities and supplying a rich benchmark for future prompt-hacking research across modalities and real-world applications.

Abstract

Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

TL;DR

This work addresses the security risks of prompt hacking in LLMs by launching HackAPrompt, a global competition that crowdsourced over 600K adversarial prompts across three state-of-the-art models. It provides a publicly available dataset and a data-driven taxonomy of 29 exploit techniques, plus cross-model and cross-intent analyses that demonstrate current LLMs remain vulnerable. The findings highlight the limited efficacy of prompt-based defenses and the need for robust, multi-faceted safeguards, while offering practical resources and insights for defenders and researchers. The work has broad practical impact by illuminating systemic vulnerabilities and supplying a rich benchmark for future prompt-hacking research across modalities and real-world applications.

Abstract

Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.
Paper Structure (101 sections, 1 equation, 25 figures, 2 tables, 1 algorithm)

This paper contains 101 sections, 1 equation, 25 figures, 2 tables, 1 algorithm.

Figures (25)

  • Figure 1: Uses of llms often define the task via a prompt template (top left), which is combined with user input (bottom left). We create a competition to see if user input can overrule the original task instructions and elicit specific target output (right).
  • Figure 2: In the competition playground, competitors select the challenge they would like to try (top left) and the model to evaluate with (upper mid left). They see the challenge description (mid left) as well as the prompt template for the challenge (lower mid left). As they type their input in the 'Your Prompt' section (bottom) and after clicking the Evaluate button (bottom), they see the combined prompt as well as completions and token counts (right).
  • Figure 3: The majority of prompts in the Playground Dataset submitted were for four Challenges (7, 9, 4, and 1) and can be viewed as a proxy for difficulty.
  • Figure 4: Token count (the number of tokens in a submission) spikes throughout the competition with heavy optimization near the deadline. The number of submissions declined slowly over time.
  • Figure 5: A Taxonomical Ontology of Prompt Hacking techniques. Blank lines are hypernyms (i.e., typos are an instance of obfuscation), while grey arrows are meronyms (i.e., Special Case attacks usually contain a Simple Instruction). Purple nodes are not attacks themselves but can be a part of attacks. Red nodes are specific examples.
  • ...and 20 more figures