Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation
Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng
TL;DR
The paper tackles the challenge of aligning LLM behavior with dynamic, scenario-specific specifications by introducing Align3, a lightweight Test-Time Deliberation method that reasons over behavioral and safety boundaries in three steps: behavior optimization, safety-guided refinement, and a holistic specification audit. It also presents SpecBench, a unified benchmark spanning 5 realistic scenarios, 103 specs, and 1,500 prompts to quantify specification alignment across both behavioral and safety dimensions, including adversarial attack enhancements. Empirical results across 18 instruct and 15 reasoning models show that test-time deliberation generally improves alignment, with Align3 delivering significant gains in SAR while maintaining minimal token overhead (e.g., SAR improvements up to 11.89% on certain models and approaching the performance of strong baselines in several settings). The work demonstrates clear alignment gaps in real-world scenarios, validates the utility of reasoning-based approaches for spec boundaries, and provides a practical, scalable framework for scenario-aware evaluation and optimization of LLM alignment. These contributions offer a practical pathway to deploy safer and more helpful LLMs in diverse, evolving environments. $SpecBench$ quantifies the trade-off between safety and helpfulness, and Align3 provides a cost-effective mechanism to push the safety–behavior frontier in real-world applications.
Abstract
Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.
