Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought
Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang
TL;DR
The paper tackles the safety-helpfulness tradeoff in large language models by introducing PACT, a hierarchical multi-policy framework that combines a non-overridable Global Policy with runtime-configurable User Policies. It leverages a risk-aware Chain-of-Thought Path (CoTPath) to perform per-label decision routing (COMPLY, GUIDE, REJECT) with an early global safety check. The authors develop a three-stage methodology—self-distillation for risk classification, CoTPath construction with Label2Action mappings, and runtime hierarchical inference—and validate it through extensive safety, helpfulness, and controllability experiments across multiple benchmarks and scenarios. The results show near state-of-the-art safety under the global policy while achieving superior controllability under user policies, highlighting improved safety-utility balance and deployment practicality.
Abstract
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
