Table of Contents
Fetching ...

Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban

TL;DR

Audit-of-Understanding (AoU) addresses reasoning-induced hallucinations by constraining inference to validated premises. It formalizes posterior-constrained inference: decompose a query into candidate premises $\mathcal{G}$, audit their support to form $\mathcal{G}^+$, and condition the final prediction on $\mathcal{G}^+$. The framework provides theoretical guarantees under perfect validation and excess-risk bounds under imperfect validation, plus tractability analysis. Empirically, AoU yields substantial accuracy and faithfulness gains on GSM8K, MultiArith, and SVAMP across multiple models, without external tools, outperforming several prompting baselines. This approach yields robust, interpretable reasoning traces and opens avenues for extending principled auditing to broader reasoning tasks and tool integrations.

Abstract

Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20--28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.

Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

TL;DR

Audit-of-Understanding (AoU) addresses reasoning-induced hallucinations by constraining inference to validated premises. It formalizes posterior-constrained inference: decompose a query into candidate premises , audit their support to form , and condition the final prediction on . The framework provides theoretical guarantees under perfect validation and excess-risk bounds under imperfect validation, plus tractability analysis. Empirically, AoU yields substantial accuracy and faithfulness gains on GSM8K, MultiArith, and SVAMP across multiple models, without external tools, outperforming several prompting baselines. This approach yields robust, interpretable reasoning traces and opens avenues for extending principled auditing to broader reasoning tasks and tool integrations.

Abstract

Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20--28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.

Paper Structure

This paper contains 35 sections, 6 theorems, 23 equations, 3 figures.

Key Result

Theorem 1

Under perfect validation, the reasoning trace of $\hat{a}_{\mathrm{AoU}}$ satisfies

Figures (3)

  • Figure 1: Illustrative example of Audit-of-Understanding. The top shows reasoning with unchecked assumptions leading to an incorrect answer, while the bottom shows AoU filtering invalid assumptions, yielding the correct answer.
  • Figure 2: Illustration of the Audit-of-Understanding (AoU) pipeline on a real math word problem. The pipeline consists of three phases: (1) Assume, (2) Audit, and (3) Solve. This prevents hallucinated reasoning steps and ensures faithfulness. The model uses G1–G4 to compute the speed and deduce that the 180-mile trip will take 4.5 hours.
  • Figure 3: Performance comparison of different models (Mistral-7B-v0.3, DeepSeek-7B, and Phi-3.5) across three datasets (GSM8k, MultiArith, SVAMP) under various decoding strategies: CoT-Greedy (blue), Self-Consistency (red), CoT-Decoding (yellow), and our proposed AoU (green).

Theorems & Definitions (16)

  • Remark : Hypothetical completions
  • Definition 1: Reasoning trace and minimal dependence set
  • Theorem 1: Trace faithfulness without unsupported premises
  • proof
  • Definition 2: Hallucination
  • Corollary 1: No hallucination under perfect validation
  • proof
  • Theorem 2: Complexity Bound
  • proof
  • Definition 3: Indicator Function
  • ...and 6 more