Table of Contents
Fetching ...

Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd

TL;DR

Highlight&Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents jailbreaking and model hijacking of Large Language Models by design, is presented and evaluated.

Abstract

Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM's system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts ("highlights") relevant passages from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe and implement several possible instantiations of H&S and evaluate their responses in terms of correctness, relevance, and quality. For certain question-answering (QA) tasks, the responses produced by H&S are judged to be as good, if not better, than those of a standard RAG pipeline.

Highlight & Summarize: RAG without the jailbreaks

TL;DR

Highlight&Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents jailbreaking and model hijacking of Large Language Models by design, is presented and evaluated.

Abstract

Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM's system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts ("highlights") relevant passages from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe and implement several possible instantiations of H&S and evaluate their responses in terms of correctness, relevance, and quality. For certain question-answering (QA) tasks, the responses produced by H&S are judged to be as good, if not better, than those of a standard RAG pipeline.

Paper Structure

This paper contains 59 sections, 4 theorems, 7 equations, 10 figures, 10 tables.

Key Result

Lemma 5.1

The size of the control region for an LLM is bounded by the size of the input space:

Figures (10)

  • Figure 1: HS-enhanced RAG pipeline. A malicious user asking a question to this system cannot jailbreak the LLM, by design.
  • Figure 2: Adaptive highlighting attack example.
  • Figure 3: Wins of each pipeline in the direct pairwise comparison by the ComparisonJudge. Ties are omitted.
  • Figure 4: Number of times each pipeline had the highest rating, including ties, according to LLM-as-Judges.
  • Figure 5: Response's evaluation via LLM-as-Judges on the two datasets, measuring: correctness, relevance, and quality.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Definition 5.1: LLM Control Region
  • Lemma 5.1: Bounded control region
  • Theorem 5.1: Exponential decrease in control
  • Lemma A.1: Bounded control region
  • proof
  • Theorem A.1: Exponential decrease in control
  • proof