Table of Contents
Fetching ...

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

Safe Transformer is proposed, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers, requiring only lightweight fine-tuning without pre-training from scratch.

Abstract

Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when $s=1$ and refusals when $s=0$ - while additional unsupervised bits $u$ encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

TL;DR

Safe Transformer is proposed, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers, requiring only lightweight fine-tuning without pre-training from scratch.

Abstract

Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its safety judgments fail. We propose Safe Transformer, a modular approach that augments pre-trained language models by inserting a discrete information bottleneck containing an explicit safety bit between transformer layers. The safety bit serves as both an interpretable signal of the model's safety classification and a controllable switch: through contrastive training, the model learns disentangled representations where the safety bit governs the behavioral mode - producing helpful responses when and refusals when - while additional unsupervised bits encode semantic content for generation. Additional unsupervised bits in the information bottleneck allow semantic information to flow through, preserving the model's generation capabilities. This design achieves both interpretability (the safety decision is directly readable) and controllability (the safety bit can be manually overridden), requiring only lightweight fine-tuning without pre-training from scratch. In red-team benchmarks, Safe Transformer achieves near-zero Attack Success Rate, substantially outperforming base models and safety fine-tuning baselines.
Paper Structure (63 sections, 16 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 63 sections, 16 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Safe transformer architecture. Orange modules are newly introduced; blue modules are from the base model. Left: The information bottleneck processes the key-value input to the decoder, ensuring generation is conditioned on the discrete code $(s, u)$. The decoder serves as an adapter that bridges the representation gap between our bottleneck output and the upper layers' expected input distribution. Right: Unlike standard VAEs with only unsupervised latents, we introduce a supervised safety bit $s$ alongside the unsupervised latent $u$. The Write-in FFN outputs logits for both components: the safety logit is discretized into $s=\mathbf{1}(z_0>0)$, while $u$ is sampled from the remaining logits. The safety logit is trained with supervised labels to classify input prompt safety, while $u$ preserves sufficient information flow through the bottleneck for generation quality.
  • Figure 2: How the latent code $(s, {\bm{u}})$ is determined during training vs. inference. Left (Training): The safety bit $s$ is fixed to the ground truth label $s^*$, while ${\bm{u}}$ is sampled from a uniform prior $p({\bm{u}})$. Right (Inference): The safety bit $s$ is computed by the encoder $f_\phi(x)$, while ${\bm{u}}$ is sampled from the prior. In both cases, tokens $y_t$ are generated autoregressively conditioned on $x$, $s$, ${\bm{u}}$, and $y_{<t}$.
  • Figure 3: Comparison of over-refusal. The safety classifier triggers on surface patterns ("kill") without understanding context, refusing benign programming questions while correctly refusing genuinely harmful requests.
  • Figure 4: Stage 1 training data examples. Safe prompts (left, green) are labeled $y=1$; unsafe prompts (right, red) are labeled $y=0$.
  • Figure 5: Safety logit distributions on the test set. Left: eos token strategy; Right: average strategy. Safe prompts (green) cluster at positive logits while unsafe prompts (red) cluster at negative logits, with clear separation at the decision boundary (dashed line at $z=0$).
  • ...and 6 more figures