Table of Contents
Fetching ...

How Sharp and Bias-Robust is a Model? Dual Evaluation Perspectives on Knowledge Graph Completion

Sooho Moon, Yunyong Ko

TL;DR

This paper identifies two underexplored evaluation aspects in knowledge graph completion: predictive sharpness of individual predictions and robustness to popularity bias. It introduces PROBE, a framework with a tunable rank transformer and a popularity-aware rank aggregator to produce perspective-aware KGC scores. Through experiments on FB15k237 and WN18RR with multiple models, PROBE reveals that traditional metrics can misestimate model performance and that rankings vary with evaluation perspective; it also provides practical guidance on choosing α and β to match applications. The authors release code and datasets, enabling researchers to evaluate KGC models under diverse, real-world requirements.

Abstract

Knowledge graph completion (KGC) aims to predict missing facts from the observed KG. While a number of KGC models have been studied, the evaluation of KGC still remain underexplored. In this paper, we observe that existing metrics overlook two key perspectives for KGC evaluation: (A1) predictive sharpness -- the degree of strictness in evaluating an individual prediction, and (A2) popularity-bias robustness -- the ability to predict low-popularity entities. Toward reflecting both perspectives, we propose a novel evaluation framework (PROBE), which consists of a rank transformer (RT) estimating the score of each prediction based on a required level of predictive sharpness and a rank aggregator (RA) aggregating all the scores in a popularity-aware manner. Experiments on real-world KGs reveal that existing metrics tend to over- or under-estimate the accuracy of KGC models, whereas PROBE yields a comprehensive understanding of KGC models and reliable evaluation results.

How Sharp and Bias-Robust is a Model? Dual Evaluation Perspectives on Knowledge Graph Completion

TL;DR

This paper identifies two underexplored evaluation aspects in knowledge graph completion: predictive sharpness of individual predictions and robustness to popularity bias. It introduces PROBE, a framework with a tunable rank transformer and a popularity-aware rank aggregator to produce perspective-aware KGC scores. Through experiments on FB15k237 and WN18RR with multiple models, PROBE reveals that traditional metrics can misestimate model performance and that rankings vary with evaluation perspective; it also provides practical guidance on choosing α and β to match applications. The authors release code and datasets, enabling researchers to evaluate KGC models under diverse, real-world requirements.

Abstract

Knowledge graph completion (KGC) aims to predict missing facts from the observed KG. While a number of KGC models have been studied, the evaluation of KGC still remain underexplored. In this paper, we observe that existing metrics overlook two key perspectives for KGC evaluation: (A1) predictive sharpness -- the degree of strictness in evaluating an individual prediction, and (A2) popularity-bias robustness -- the ability to predict low-popularity entities. Toward reflecting both perspectives, we propose a novel evaluation framework (PROBE), which consists of a rank transformer (RT) estimating the score of each prediction based on a required level of predictive sharpness and a rank aggregator (RA) aggregating all the scores in a popularity-aware manner. Experiments on real-world KGs reveal that existing metrics tend to over- or under-estimate the accuracy of KGC models, whereas PROBE yields a comprehensive understanding of KGC models and reliable evaluation results.

Paper Structure

This paper contains 8 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Observations: Existing rank-based metrics have overlooked two subtle yet critical perspectives (P1) and (P2).
  • Figure 2: Rank transformers of PROBE: Affine RT ensures that all scores are in the range [0,1].
  • Figure 3: (a) Similar distributions in training and test sets and (b) weight functions with varying popularity-bias robustness.
  • Figure 4: The 3-D visualized accuracy of KGC models: Different KGC models exhibit varying performance patterns depending on different evaluation perspectives.