Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Jianfeng Si; Lin Sun; Weihong Lin; Xiangzheng Zhang

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang

TL;DR

The paper tackles the safety-helpfulness tradeoff in large language models by introducing PACT, a hierarchical multi-policy framework that combines a non-overridable Global Policy with runtime-configurable User Policies. It leverages a risk-aware Chain-of-Thought Path (CoTPath) to perform per-label decision routing (COMPLY, GUIDE, REJECT) with an early global safety check. The authors develop a three-stage methodology—self-distillation for risk classification, CoTPath construction with Label2Action mappings, and runtime hierarchical inference—and validate it through extensive safety, helpfulness, and controllability experiments across multiple benchmarks and scenarios. The results show near state-of-the-art safety under the global policy while achieving superior controllability under user policies, highlighting improved safety-utility balance and deployment practicality.

Abstract

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

TL;DR

Abstract

Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

Paper Structure (45 sections, 2 equations, 2 figures, 5 tables)

This paper contains 45 sections, 2 equations, 2 figures, 5 tables.

Introduction
Related Work
Safety Alignment and Risk Detection
Instruction Hierarchies and Prompt Guardrails
Configurable Safety Behaviors and Control
Positioning of PACT
Methodology
Multi-Directional Self-Distillation
CoTPath Construction and Unified SFT
Runtime Hierarchical Inference
Experiments
Experimental Setup
Model Implementation
Baseline Models
Evaluation Datasets
...and 30 more sections

Figures (2)

Figure 1: PACT End-to-End Architecture. The CoTPath explicitly encodes the Classify$\rightarrow$Act pathway, with early-exit at Global Policy detection for critical risks.
Figure 2: Execution Flow of CoTPath with Label2Action Mapping in PACT Runtime Safety Control.

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

TL;DR

Abstract

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Authors

TL;DR

Abstract

Table of Contents

Figures (2)