Table of Contents
Fetching ...

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma

TL;DR

Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning, is proposed, grounded in causal identifiability theory, and shows that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

Abstract

Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

TL;DR

Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning, is proposed, grounded in causal identifiability theory, and shows that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

Abstract

Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
Paper Structure (42 sections, 3 theorems, 20 equations, 3 figures, 4 tables)

This paper contains 42 sections, 3 theorems, 20 equations, 3 figures, 4 tables.

Key Result

Theorem 1

Assume the hidden state generation $h = f(c, s)$ satisfies the independence and connectivity conditions. A probe $g(\cdot)$ trained to minimize the following objective will provably recover the latent intent $c$ up to an invertible transformation: where $\lambda$ is a hyperparameter, $h$ and $h^+$ are representations of the same intent $c$ under different styles $s, s'$, and $\mathcal{L}_{un}$ en

Figures (3)

  • Figure 1: Visualizing Semantic Collapse in Shallowly Aligned Models. (a) PCA projection shows that while harmful (red squares) and safe (blue circles) requests are distinct at $t=0$, they collapse into a single, indistinguishable singularity after prefix injection ($t>k$); (b) The accuracy of a linear probe trained at $t=0$ drops rapidly during the prefix injection phase (Phase 2) and remains slightly above the random chance level ($0.5$) during subsequent generation (Phase 3).
  • Figure 2: Hyper-parameter sensitivity analysis evaluating the ASR against GCG attacks on AdvBench across three base models. The sub-figures illustrate the impact of varying (a) the uniformity loss weight $\lambda$ in Stage 1, (b) the causal reward coefficient $\alpha$ in Stage 2, and (c) the similarity threshold $\tau$.
  • Figure 3: Ablation study on data construction strategies. The bottom table details the specific combination of data views used for probe training, where $\bullet$ indicates inclusion.

Theorems & Definitions (5)

  • Theorem 1: Identifiability of Latent Intent
  • Lemma 1: Style Invariance
  • proof
  • Lemma 2: Content Injectivity
  • proof