Table of Contents
Fetching ...

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

Radu Marinescu, Debarun Bhattacharjya, Junkyu Lee, Tigran Tchrakian, Javier Carnerero Cano, Yufang Hou, Elizabeth Daly, Alessandra Pascale

TL;DR

This paper tackles the challenge of factuality in long-form LLM outputs by introducing FactReasoner, a neuro-symbolic factuality assessor that uses a graphical model to jointly reason over atomic units and retrieved contexts. It decomposes responses into atoms, retrieves evidence, and encodes entailment/contradiction relations as probabilistic factors, computing posterior marginals P(A_i) with efficient inference. The authors present three pipeline variants (FR1, FR2, FR3) that progressively incorporate more cross-context relationships, and introduce a novel E-measure alongside Pr(y) and F1@K to quantify factuality. Across labeled and unlabeled benchmarks, FactReasoner, especially FR2 and FR3, consistently outperforms prompting-based methods in factual precision and recall, demonstrating robustness to conflicting information and leveraging LLM strengths in NLI. The work provides an open-source implementation and highlights practical considerations and limitations for deployment, including sensitivity to decomposition, retrieval quality, and computational overhead.

Abstract

Large language models (LLMs) have achieved remarkable success in generative tasks, yet they often fall short in ensuring the factual accuracy of their outputs, thus limiting their reliability in real-world applications where correctness is critical. In this paper, we present FactReasoner, a novel neuro-symbolic based factuality assessment framework that employs probabilistic reasoning to evaluate the truthfulness of long-form generated responses. FactReasoner decomposes a response into atomic units, retrieves relevant contextual information from external knowledge sources, and models the logical relationships (e.g., entailment, contradiction) between these units and their contexts using probabilistic encodings. It then estimates the posterior probability that each atomic unit is supported by the retrieved evidence. Our experiments on both labeled and unlabeled benchmark datasets demonstrate that FactReasoner often outperforms state-of-the-art prompt-based methods in terms of factual precision and recall. Our open-source implementation is publicly available at: https://github.com/IBM/FactReasoner.

FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

TL;DR

This paper tackles the challenge of factuality in long-form LLM outputs by introducing FactReasoner, a neuro-symbolic factuality assessor that uses a graphical model to jointly reason over atomic units and retrieved contexts. It decomposes responses into atoms, retrieves evidence, and encodes entailment/contradiction relations as probabilistic factors, computing posterior marginals P(A_i) with efficient inference. The authors present three pipeline variants (FR1, FR2, FR3) that progressively incorporate more cross-context relationships, and introduce a novel E-measure alongside Pr(y) and F1@K to quantify factuality. Across labeled and unlabeled benchmarks, FactReasoner, especially FR2 and FR3, consistently outperforms prompting-based methods in factual precision and recall, demonstrating robustness to conflicting information and leveraging LLM strengths in NLI. The work provides an open-source implementation and highlights practical considerations and limitations for deployment, including sensitivity to decomposition, retrieval quality, and computational overhead.

Abstract

Large language models (LLMs) have achieved remarkable success in generative tasks, yet they often fall short in ensuring the factual accuracy of their outputs, thus limiting their reliability in real-world applications where correctness is critical. In this paper, we present FactReasoner, a novel neuro-symbolic based factuality assessment framework that employs probabilistic reasoning to evaluate the truthfulness of long-form generated responses. FactReasoner decomposes a response into atomic units, retrieves relevant contextual information from external knowledge sources, and models the logical relationships (e.g., entailment, contradiction) between these units and their contexts using probabilistic encodings. It then estimates the posterior probability that each atomic unit is supported by the retrieved evidence. Our experiments on both labeled and unlabeled benchmark datasets demonstrate that FactReasoner often outperforms state-of-the-art prompt-based methods in terms of factual precision and recall. Our open-source implementation is publicly available at: https://github.com/IBM/FactReasoner.

Paper Structure

This paper contains 41 sections, 14 equations, 10 figures, 40 tables.

Figures (10)

  • Figure 1: FactReasoner: the graphical model corresponding to one atom $A_1$ and two contexts $C_1$ and $C_2$ such that $C_1$ entails $A_1$ and $C_2$ contradicts $A_1$.
  • Figure 2: FactReasoner: the graphical model corresponding to one atom $A_1$ and three contexts $C_1$, $C_2$ and $C_3$ such that $C_3$ contradicts $C_2$.
  • Figure 3: The FactReasoner pipeline.
  • Figure 4: A graphical model with three bi-valued variables $X_1$, $X_2$ and $X_3$, and three binary functions.
  • Figure 5: An example user prompt and the corresponding long form response together with its supported (green) and not supported (red) atomic units.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Example 1
  • Example 2
  • Example 3
  • Example 4