RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta; Erik Jenner

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta, Erik Jenner

TL;DR

RL-Obfuscation is introduced, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour, and it is found that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust.

Abstract

Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, for these vulnerable monitors, models trained to evade a single static monitor can generalise to evade other unseen monitors. We also find that the models can be trained to conditionally bypass latent-space monitors on only certain inputs. Finally, we study how the models bypass these monitors and find that the model can learn to repurpose tokens to have different internal representations.

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

TL;DR

Abstract

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)