Table of Contents
Fetching ...

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation

Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng

TL;DR

The paper tackles the challenge of aligning LLM behavior with dynamic, scenario-specific specifications by introducing Align3, a lightweight Test-Time Deliberation method that reasons over behavioral and safety boundaries in three steps: behavior optimization, safety-guided refinement, and a holistic specification audit. It also presents SpecBench, a unified benchmark spanning 5 realistic scenarios, 103 specs, and 1,500 prompts to quantify specification alignment across both behavioral and safety dimensions, including adversarial attack enhancements. Empirical results across 18 instruct and 15 reasoning models show that test-time deliberation generally improves alignment, with Align3 delivering significant gains in SAR while maintaining minimal token overhead (e.g., SAR improvements up to 11.89% on certain models and approaching the performance of strong baselines in several settings). The work demonstrates clear alignment gaps in real-world scenarios, validates the utility of reasoning-based approaches for spec boundaries, and provides a practical, scalable framework for scenario-aware evaluation and optimization of LLM alignment. These contributions offer a practical pathway to deploy safer and more helpful LLMs in diverse, evolving environments. $SpecBench$ quantifies the trade-off between safety and helpfulness, and Align3 provides a cost-effective mechanism to push the safety–behavior frontier in real-world applications.

Abstract

Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation

TL;DR

The paper tackles the challenge of aligning LLM behavior with dynamic, scenario-specific specifications by introducing Align3, a lightweight Test-Time Deliberation method that reasons over behavioral and safety boundaries in three steps: behavior optimization, safety-guided refinement, and a holistic specification audit. It also presents SpecBench, a unified benchmark spanning 5 realistic scenarios, 103 specs, and 1,500 prompts to quantify specification alignment across both behavioral and safety dimensions, including adversarial attack enhancements. Empirical results across 18 instruct and 15 reasoning models show that test-time deliberation generally improves alignment, with Align3 delivering significant gains in SAR while maintaining minimal token overhead (e.g., SAR improvements up to 11.89% on certain models and approaching the performance of strong baselines in several settings). The work demonstrates clear alignment gaps in real-world scenarios, validates the utility of reasoning-based approaches for spec boundaries, and provides a practical, scalable framework for scenario-aware evaluation and optimization of LLM alignment. These contributions offer a practical pathway to deploy safer and more helpful LLMs in diverse, evolving environments. quantifies the trade-off between safety and helpfulness, and Align3 provides a cost-effective mechanism to push the safety–behavior frontier in real-world applications.

Abstract

Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.

Paper Structure

This paper contains 79 sections, 5 equations, 37 figures, 7 tables, 1 algorithm.

Figures (37)

  • Figure 1: Illustration of our proposed specification alignment across diverse scenarios.
  • Figure 2: Representative results. x-axis: safety score, y-axis: behavioral score, both defined in Sec. \ref{['sec:exp_setup']}, measuring safety and helpfulness respectively.
  • Figure 3: Overview of our work. (a) introduces specification alignment by jointly optimizing safety and behavioral specifications (Sec. \ref{['sec:specification_alignment']}). (b) details the construction of SpecBench, covering scenario and specification design, data curation with LLMs and human verification, and an evaluation pipeline where each spec is judged as YES, NO, or NA (Sec. \ref{['sec:specbench']}). (c) shows test-time deliberation methods that reason over specification boundaries, including our proposed Align3 (Sec. \ref{['sec:exp_align3']}).
  • Figure 4: Data sources for each scenario.
  • Figure 5: Metrics (%) across data splits, averaged over all models with std error bars.
  • ...and 32 more figures