Table of Contents
Fetching ...

Credible Intervals for Knowledge Graph Accuracy Estimation

Stefano Marchesin, Gianmaria Silvello

TL;DR

This work tackles the challenge of estimating Knowledge Graph accuracy with statistical guarantees when annotating every triple is infeasible. It shifts from frequentist confidence intervals to Bayesian credible intervals, showing that Highest Posterior Density (HPD) CrIs provide shorter, more reliable intervals, especially for skewed posteriors, and introduces the adaptive aHPD algorithm to automatically combine multiple uninformative priors without manual tuning. The results demonstrate substantial annotation-cost reductions across real and synthetic KG datasets, with up to 47% savings in high-precision scenarios, and robust performance across sampling strategies and dataset scales. Overall, the approach offers interpretable, one-shot probabilistic guarantees and practical efficiency for KG quality evaluation in real-world settings.

Abstract

Knowledge Graphs (KGs) are widely used in data-driven applications and downstream tasks, such as virtual assistants, recommendation systems, and semantic search. The accuracy of KGs directly impacts the reliability of the inferred knowledge and outcomes. Therefore, assessing the accuracy of a KG is essential for ensuring the quality of facts used in these tasks. However, the large size of real-world KGs makes manual triple-by-triple annotation impractical, thereby requiring sampling strategies to provide accuracy estimates with statistical guarantees. The current state-of-the-art approaches rely on Confidence Intervals (CIs), derived from frequentist statistics. While efficient, CIs have notable limitations and can lead to interpretation fallacies. In this paper, we propose to overcome the limitations of CIs by using \emph{Credible Intervals} (CrIs), which are grounded in Bayesian statistics. These intervals are more suitable for reliable post-data inference, particularly in KG accuracy evaluation. We prove that CrIs offer greater reliability and stronger guarantees than frequentist approaches in this context. Additionally, we introduce \emph{a}HPD, an adaptive algorithm that is more efficient for real-world KGs and statistically robust, addressing the interpretive challenges of CIs.

Credible Intervals for Knowledge Graph Accuracy Estimation

TL;DR

This work tackles the challenge of estimating Knowledge Graph accuracy with statistical guarantees when annotating every triple is infeasible. It shifts from frequentist confidence intervals to Bayesian credible intervals, showing that Highest Posterior Density (HPD) CrIs provide shorter, more reliable intervals, especially for skewed posteriors, and introduces the adaptive aHPD algorithm to automatically combine multiple uninformative priors without manual tuning. The results demonstrate substantial annotation-cost reductions across real and synthetic KG datasets, with up to 47% savings in high-precision scenarios, and robust performance across sampling strategies and dataset scales. Overall, the approach offers interpretable, one-shot probabilistic guarantees and practical efficiency for KG quality evaluation in real-world settings.

Abstract

Knowledge Graphs (KGs) are widely used in data-driven applications and downstream tasks, such as virtual assistants, recommendation systems, and semantic search. The accuracy of KGs directly impacts the reliability of the inferred knowledge and outcomes. Therefore, assessing the accuracy of a KG is essential for ensuring the quality of facts used in these tasks. However, the large size of real-world KGs makes manual triple-by-triple annotation impractical, thereby requiring sampling strategies to provide accuracy estimates with statistical guarantees. The current state-of-the-art approaches rely on Confidence Intervals (CIs), derived from frequentist statistics. While efficient, CIs have notable limitations and can lead to interpretation fallacies. In this paper, we propose to overcome the limitations of CIs by using \emph{Credible Intervals} (CrIs), which are grounded in Bayesian statistics. These intervals are more suitable for reliable post-data inference, particularly in KG accuracy evaluation. We prove that CrIs offer greater reliability and stronger guarantees than frequentist approaches in this context. Additionally, we introduce \emph{a}HPD, an adaptive algorithm that is more efficient for real-world KGs and statistically robust, addressing the interpretive challenges of CIs.

Paper Structure

This paper contains 41 sections, 5 theorems, 13 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Given a prior distribution $\operatorname{Beta}(a, b)$ for $\mu$ and a KG correctness annotation process, where $0 < \tau_{\mathcal{S}} < n_{\mathcal{S}}$, the $1-\alpha$HPD interval is the smallest $(l, u)$ interval satisfying the condition $F(u) - F(l) = 1 - \alpha$.

Figures (4)

  • Figure 1: Efficient KG accuracy evaluation framework marchesin_silvello-2024.
  • Figure 2: Comparison of ET and HPD CrI across three posterior distributions for KG accuracy with increasing skewness (from left to right). In panel (a), where the posterior is symmetric, both ET and HPD intervals capture the most probable values (purple region). However, as skewness increases in panels (b) and (c), the ET interval deviates from the highest posterior density region, including less likely parameter values (red region), and resulting in unnecessarily longer intervals to satisfy the $1-\alpha = F(u) - F(l)$ condition. In contrast, the HPD intervals remain optimal, covering only the highest posterior density region (purple and blue regions), thereby highlighting the superior performance of HPD in skewed distributions compared to ET.
  • Figure 3: Expected width of HPD credible intervals under Kerman, Jeffreys, and Uniform priors for $n_{\mathcal{S}} = 30$ and $\alpha = 0.05$. The circle pattern ($\circ$) under the curves represents the set of accuracy values where Kerman prior provides the shortest expected width, whereas the line pattern ($//$) the set of accuracy values where Uniform performs best.
  • Figure 4: Annotation cost comparison between aHPD and Wilson at different confidence levels $1-\alpha$ under SRS (a) and TWCS (b) on YAGO, NELL, DBPEDIA, and FACTBENCH KGs. We also report the reduction ratio (in %) of aHPD over Wilson.

Theorems & Definitions (7)

  • Example 1
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Theorem 3
  • Example 2