Table of Contents
Fetching ...

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski, Yuval Kainan

TL;DR

The paper tackles safety gaps in large language models where encoded or ciphered prompts bypass guardrails. It introduces CPT-Filtering, a low-cost, model-agnostic defense that uses the average characters per token produced by Byte-Pair Encoding tokenizers to detect out-of-distribution, obfuscated text. Across a dataset of roughly 120k prompts and multiple encoding schemes, CPT-Filtering achieves near-perfect separation between original and obfuscated prompts, even for very short inputs, and outperforms perplexity-based guards with negligible compute. The method supports real-time filtering and data curation, demonstrates robustness across several tokenizers, and addresses multilingual and mixed-input scenarios via sliding-window detection. This approach offers a practical, deployable layer of defense against jailbreak attempts without requiring additional models or heavy computation.

Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

TL;DR

The paper tackles safety gaps in large language models where encoded or ciphered prompts bypass guardrails. It introduces CPT-Filtering, a low-cost, model-agnostic defense that uses the average characters per token produced by Byte-Pair Encoding tokenizers to detect out-of-distribution, obfuscated text. Across a dataset of roughly 120k prompts and multiple encoding schemes, CPT-Filtering achieves near-perfect separation between original and obfuscated prompts, even for very short inputs, and outperforms perplexity-based guards with negligible compute. The method supports real-time filtering and data curation, demonstrates robustness across several tokenizers, and addresses multilingual and mixed-input scenarios via sliding-window detection. This approach offers a practical, deployable layer of defense against jailbreak attempts without requiring additional models or heavy computation.

Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.

Paper Structure

This paper contains 26 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example of a tokenized representation (using GPT-4o) of an English text (top) and a ciphered version of it using Caesar Cipher (bottom). While both containing the same number of characters (613), the original version is constructed of 128 tokens, while the ciphered version is constructed of 294 tokens --- more than double.
  • Figure 2: The average number of characters-per-token for the four models checked (We cut the y-axis at $0.2$ to better visualize the distributions)
  • Figure 3: Percentage of prompts marked as "obfuscated-prompts" in each category, as a function of prompt length tokens
  • Figure 4: Sum of number of characters per token of a given prompt divided by the length (in characters) of the prompt, using GPT-2 tokenizer. Other than Arabic and Chinese, all other classes are confined to 1.0
  • Figure 5: Top: the distribution of perplexity for the original and obfuscated prompts. Bottom: the distribution of perplexity and average character-per-token for the original and obfuscated prompts.
  • ...and 4 more figures