Table of Contents
Fetching ...

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran

TL;DR

The paper analyzes why safety alignment in large language models remains incomplete, identifying gradient concentration and signal decay in autoregressive training as the root cause of stronger safety changes in early tokens and weak changes later on. It introduces base-favored tokens as fine-grained indicators of undertrained regions and validates a two-pronged remedy: inference-time contrastive decoding to demonstrate mechanistic control, and a training-time targeted completion framework that applies adaptive penalties and a hybrid teacher to finish the learned safety distribution across all positions. Across multiple model families (e.g., Llama and Qwen), the approach yields substantial improvements in adversarial robustness (attack reductions of roughly 48–96%) while preserving utility, and enables deep safety alignment that enhances proactive deliberative reasoning under attack. The work offers a principled, mechanistic path to complete safety learning without broad retraining, with strong implications for production deployment, scalability, and safer AI systems.

Abstract

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

TL;DR

The paper analyzes why safety alignment in large language models remains incomplete, identifying gradient concentration and signal decay in autoregressive training as the root cause of stronger safety changes in early tokens and weak changes later on. It introduces base-favored tokens as fine-grained indicators of undertrained regions and validates a two-pronged remedy: inference-time contrastive decoding to demonstrate mechanistic control, and a training-time targeted completion framework that applies adaptive penalties and a hybrid teacher to finish the learned safety distribution across all positions. Across multiple model families (e.g., Llama and Qwen), the approach yields substantial improvements in adversarial robustness (attack reductions of roughly 48–96%) while preserving utility, and enables deep safety alignment that enhances proactive deliberative reasoning under attack. The work offers a principled, mechanistic path to complete safety learning without broad retraining, with strong implications for production deployment, scalability, and safer AI systems.

Abstract

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

Paper Structure

This paper contains 34 sections, 18 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Position-wise analysis of shallow alignment detection. Left: KL divergence between aligned and base models shows declining distributional differences across token positions. Right: Base-favored tokens exhibit the same shallow alignment pattern, with adversarial safety contexts showing systematically higher counts than benign contexts (45 vs 30 tokens at early positions, 35 vs 25 at later positions). Base-favored tokens validate shallow alignment detection while providing vocabulary-level identification of specific undertrained tokens that aggregate measures cannot localize.
  • Figure 2: Base-Favored Tokens Reveal Distributional Differences. Most frequent base-favored tokens for Llama-3-8B using harmful instruction-response pairs. Tokens are predominantly formatting elements (punctuation, special tokens), common words, and structural elements rather than explicitly harmful content, supporting the distributional alignment hypothesis.
  • Figure 3: KL-divergence across token positions on LLama 3.1 8B between aligned and base model using normal decoding (blue) and contrastive decoding intervention (orange) in safety-critical contexts (Hex-PHI dataset). Contrastive decoding maintains higher KL-divergence throughout the sequence, indicating sustained safety alignment in later positions.
  • Figure 4: Deep Alignment Recovery Under Adversarial Attack. Prefill attack success rates demonstrate that our method (44.5% ASR) significantly outperforms existing safety preservation methods (69.3-73.4% ASR) and approaches the robustness of uncompromised base models (47.4% ASR), validating comprehensive deep alignment restoration.
  • Figure 5: Base-Favored Token Frequency Analysis Across Model Families. Most frequent base-favored tokens before (blue, top) and after (orange, bottom) targeted completion intervention across four models. Increased frequencies in orange bars demonstrate the successful application of our penalty mechanism, which systematically targets and suppresses base-favored tokens in safety-critical contexts. The cross-model consistency validates the generalizability of our approach.

Theorems & Definitions (4)

  • Definition 1: Adversarial Safety Contexts
  • Definition 2: Base-Favored Tokens
  • Definition 3: Targeted $L_2$ Completion Loss
  • Definition 4: Hybrid Teacher Model