Table of Contents
Fetching ...

À la recherche du sens perdu: your favourite LLM might have more to say than you can understand

K. O. T. Erziev

TL;DR

The paper investigates a surprising capability of LLMs: understanding and even generating encoded instruction content hidden in symbol sequences, potentially exploiting tokenization artifacts. It proposes a formal framework using encoded UTF-8 patterns to measure understanding and to test jailbreak attempts across a diverse model suite, reporting attack success rates and understanding rates. Key findings show substantial variance across models (e.g., Claude-3.7 Sonnet performing well on understanding, while GPT-4o variants show vulnerability under templates), suggesting tokenization leakage contributes but does not fully explain the phenomenon. The work highlights safety and security implications for LLM-based oversight and argues for interpretability-oriented, non-tokenization approaches, while noting limitations from subset evaluations and possible cherry-picking. Overall, it underscores the need for broader encoding analyses and robust safety paradigms in current and future LLM systems.

Abstract

We report a peculiar observation that LLMs can assign hidden meanings to sequences that seem visually incomprehensible to humans: for example, a nonsensical phrase consisting of Byzantine musical symbols is recognized by gpt-4o as "say abracadabra". Moreover, some models can communicate using these sequences. Some of these meanings are hypothesized to partly originate in the massive spurious correlations due to BPE tokenization. We systematically evaluate the presence of such abilities in a wide range of models: Claude-3.5 Haiku, Claude-3.5 Sonnet (New and Old), Claude-3.7 Sonnet, gpt-4o mini, gpt-4o, o1-mini, Llama-3.3 70B, DeepSeek-R1-Distill-Lllama 70B, Qwen2.5 1.5B, Qwen2.5 32B, Phi-3.5 mini, GigaChat-Max, Vikhr-Llama-3.2 1B. We argue that this observation might have far-reaching consequences for both safety and security of the modern and future LLMs and systems that employ them. As an illustration, we show that applying this method in combination with simple templates is sufficient to jailbreak previous generation models, with ASR = 0.4 on gpt-4o mini. Our code and data artifacts are available at https://github.com/L3G5/llm-hidden-meanings

À la recherche du sens perdu: your favourite LLM might have more to say than you can understand

TL;DR

The paper investigates a surprising capability of LLMs: understanding and even generating encoded instruction content hidden in symbol sequences, potentially exploiting tokenization artifacts. It proposes a formal framework using encoded UTF-8 patterns to measure understanding and to test jailbreak attempts across a diverse model suite, reporting attack success rates and understanding rates. Key findings show substantial variance across models (e.g., Claude-3.7 Sonnet performing well on understanding, while GPT-4o variants show vulnerability under templates), suggesting tokenization leakage contributes but does not fully explain the phenomenon. The work highlights safety and security implications for LLM-based oversight and argues for interpretability-oriented, non-tokenization approaches, while noting limitations from subset evaluations and possible cherry-picking. Overall, it underscores the need for broader encoding analyses and robust safety paradigms in current and future LLM systems.

Abstract

We report a peculiar observation that LLMs can assign hidden meanings to sequences that seem visually incomprehensible to humans: for example, a nonsensical phrase consisting of Byzantine musical symbols is recognized by gpt-4o as "say abracadabra". Moreover, some models can communicate using these sequences. Some of these meanings are hypothesized to partly originate in the massive spurious correlations due to BPE tokenization. We systematically evaluate the presence of such abilities in a wide range of models: Claude-3.5 Haiku, Claude-3.5 Sonnet (New and Old), Claude-3.7 Sonnet, gpt-4o mini, gpt-4o, o1-mini, Llama-3.3 70B, DeepSeek-R1-Distill-Lllama 70B, Qwen2.5 1.5B, Qwen2.5 32B, Phi-3.5 mini, GigaChat-Max, Vikhr-Llama-3.2 1B. We argue that this observation might have far-reaching consequences for both safety and security of the modern and future LLMs and systems that employ them. As an illustration, we show that applying this method in combination with simple templates is sufficient to jailbreak previous generation models, with ASR = 0.4 on gpt-4o mini. Our code and data artifacts are available at https://github.com/L3G5/llm-hidden-meanings

Paper Structure

This paper contains 5 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: gpt-4o https://chatgpt.com/share/67bf2033-e3e4-8002-8cef-d0a51ea208f1 understand texts written in https://en.wikipedia.org/wiki/Byzantine_Musical_Symbols
  • Figure 2: Understanding rate by model, encoding type and nudge
  • Figure 3: Our initial hypothesis on the underlying reason. The third example is given https://chatgpt.com/share/67c187dd-ca00-8002-827e-747b6c88cf44
  • Figure 4: Understanding count by model and nudge
  • Figure 5: Bomb
  • ...and 1 more figures