Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski; Yuval Kainan

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski, Yuval Kainan

TL;DR

The paper tackles safety gaps in large language models where encoded or ciphered prompts bypass guardrails. It introduces CPT-Filtering, a low-cost, model-agnostic defense that uses the average characters per token produced by Byte-Pair Encoding tokenizers to detect out-of-distribution, obfuscated text. Across a dataset of roughly 120k prompts and multiple encoding schemes, CPT-Filtering achieves near-perfect separation between original and obfuscated prompts, even for very short inputs, and outperforms perplexity-based guards with negligible compute. The method supports real-time filtering and data curation, demonstrates robustness across several tokenizers, and addresses multilingual and mixed-input scenarios via sliding-window detection. This approach offers a practical, deployable layer of defense against jailbreak attempts without requiring additional models or heavy computation.

Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

TL;DR

Abstract

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)