Table of Contents
Fetching ...

Optimizing Decomposition for Optimal Claim Verification

Yining Lu, Noah Ziems, Hy Dang, Meng Jiang

TL;DR

This work addresses misalignment between decomposition and verification in long-form factuality evaluation by introducing a verifier-aware, dynamic decomposition framework. By formulating the problem as a bilevel optimization and solving it with an on-policy RL approach (DyDecomp), the method learns a decomposition policy that tunes subclaim atomicity to each verifier’s preferred information density. Empirical results show DyDecomp improves verification confidence by about 0.07 and accuracy by about 0.12 on multiple verifiers and datasets, while requiring only 4.73M parameters. The study also demonstrates that verification confidence correlates strongly with accuracy and that optimal atomicity varies across verifiers, underscoring the value of adapting decomposition to downstream verification systems for robust long-form factuality evaluation.

Abstract

Current research on the \textit{Decompose-Then-Verify} paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity -- a novel metric quantifying information density -- leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.

Optimizing Decomposition for Optimal Claim Verification

TL;DR

This work addresses misalignment between decomposition and verification in long-form factuality evaluation by introducing a verifier-aware, dynamic decomposition framework. By formulating the problem as a bilevel optimization and solving it with an on-policy RL approach (DyDecomp), the method learns a decomposition policy that tunes subclaim atomicity to each verifier’s preferred information density. Empirical results show DyDecomp improves verification confidence by about 0.07 and accuracy by about 0.12 on multiple verifiers and datasets, while requiring only 4.73M parameters. The study also demonstrates that verification confidence correlates strongly with accuracy and that optimal atomicity varies across verifiers, underscoring the value of adapting decomposition to downstream verification systems for robust long-form factuality evaluation.

Abstract

Current research on the \textit{Decompose-Then-Verify} paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity -- a novel metric quantifying information density -- leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.

Paper Structure

This paper contains 49 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: overall framework of Decompose-Then-Verify paradigm. We define each decomposer and verifier as a LLM paired with a corresponding policy. Our dynamic decomposition is compatible with existing fact-checking systems and requires training only a decomposition policy with 4.73M parameters. Right: the figure (upper right) shows that the verification confidence of the verifier (i.e., Inst-Llama-7B with a retrieval verification policy) peaks at atomicity 1. An atomicity of -1 denotes the claim is partially trivial and tautological. The example (lower right) shows that the decomposition policy from FActScore min-etal-2023-factscore fails to generate subclaims that best evoke the verifier's performance, leading to suboptimal results. We provide an additional example in Appendix \ref{['appendix: example']} to show the limitation of existing decomposition policies.
  • Figure 2: Verification confidence versus accuracy. The number in each convex hull denotes the claim atomicity. Irrespective of data sources, atomicities, and verifiers, verification confidence exhibits a strong positive correlation with accuracy (0.88 Pearson's r).
  • Figure 3: Breadth-first order sampling for dynamic decomposition. We perform binary decomposition for each claim. The number in the node represents its sampling priority in the decomposition process. We first sample out subclaims at the same atomicity level, with newly generated subclaims queued in a FIFO (first-in-first-out) order.
  • Figure 4: Verification confidence across atomicities. Evidently, each verifier has its own preferred input atomicity at which the verification confidence peaks. Even when utilizing the same verification policy, such as retrieval, different verifiers exhibit distinct preferences, and vice versa.
  • Figure 5: The verification sensitivity of dynamic decomposition as the training data size changes. The five figures (from left to right) represent claims with atomicity in the range $[0,4]$, evaluated under DyDecomp policy trained on different dataset sizes. The horizontal dashed line denotes verification confidence for original claims without decomposition. We use decomposition LLM Llama3-Inst-70B and verification LLM Llama3-Inst-8B with retrieval verification policy.
  • ...and 1 more figures