Table of Contents
Fetching ...

$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya

Abstract

Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Abstract

Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.
Paper Structure (48 sections, 13 equations, 6 figures, 10 tables)

This paper contains 48 sections, 13 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: While traditional policy-adherence agents need in-context business policies which can range from 10k to 90k tokens, our method only recalls the relevant policies. For each request, the relevant policies can extend up to 150-400 tokens, which uses up to 225x fewer tokens than the traditional method.
  • Figure 2: Overview of our multi-stage CoT refinement loop, consisting of Generation, Rubric Evaluation, CoT Evaluation, Targeted Refinement.
  • Figure 3: Overview of the proposed PolicyRecall reward, consisting of a policy-recall–based reward and a hallucination-based penalty.
  • Figure 4: Evolution of CoT quality metrics through iterative generation-refinement, showing consistent improvements across all dimensions (4.5%--14.6% gains).
  • Figure 5: Example CoTs across different rounds of filtering and refinement. Correct and relevant policies are shown in green, hallucinated policies in red, and missing policies in orange.
  • ...and 1 more figures