Table of Contents
Fetching ...

Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework for Knowledge Graph Link Predictors

Filip Cornell, Yifei Jin, Jussi Karlgren, Sarunas Girdzijauskas

TL;DR

This work tackles the heavy computational cost of full ranking in Knowledge Graph Completion evaluation by introducing a relation-recommender-guided sampling framework that generates informative negative candidates from relation domains and ranges. By deploying simple, scalable relation recommenders (e.g., Linear WD variants) and two sampling strategies (Static and Probabilistic), the approach dramatically reduces evaluation time while preserving or closely matching true ranking metrics, demonstrated especially on the ogbl-wikikg2 benchmark where full evaluation times drop from minutes to seconds. The authors provide theoretical justifications and extensive empirical results, showing improved estimation accuracy (lower MAE, stable Kendall-Tau) and substantial speed-ups without sacrificing reliability, and they discuss practical guidance on recommender choice and dataset-specific considerations. Overall, the framework enables fast, reliable benchmarking and faster iteration in large-scale KG research, with potential extensions to training-time sampling and other ranking-style tasks.

Abstract

The standard evaluation protocol for measuring the quality of Knowledge Graph Completion methods - the task of inferring new links to be added to a graph - typically involves a step which ranks every entity of a Knowledge Graph to assess their fit as a head or tail of a candidate link to be added. In Knowledge Graphs on a larger scale, this task rapidly becomes prohibitively heavy. Previous approaches mitigate this problem by using random sampling of entities to assess the quality of links predicted or suggested by a method. However, we show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes. In this paper, we present a thorough analysis of these effects along with the following findings. First, we empirically find and theoretically motivate why sampling uniformly at random vastly overestimates the ranking performance of a method. We show that this can be attributed to the effect of easy versus hard negative candidates. Second, we propose a framework that uses relational recommenders to guide the selection of candidates for evaluation. We provide both theoretical and empirical justification of our methodology, and find that simple and fast methods can work extremely well, and that they match advanced neural approaches. Even when a large portion of true candidates for a property are missed, the estimation barely deteriorates. With our proposed framework, we can reduce the time and computation needed similar to random sampling strategies while vastly improving the estimation; on ogbl-wikikg2, we show that accurate estimations of the full, filtered ranking can be obtained in 20 seconds instead of 30 minutes. We conclude that considerable computational effort can be saved by effective preprocessing and sampling methods and still reliably predict performance accurately of the true performance for the entire ranking procedure.

Are We Wasting Time? A Fast, Accurate Performance Evaluation Framework for Knowledge Graph Link Predictors

TL;DR

This work tackles the heavy computational cost of full ranking in Knowledge Graph Completion evaluation by introducing a relation-recommender-guided sampling framework that generates informative negative candidates from relation domains and ranges. By deploying simple, scalable relation recommenders (e.g., Linear WD variants) and two sampling strategies (Static and Probabilistic), the approach dramatically reduces evaluation time while preserving or closely matching true ranking metrics, demonstrated especially on the ogbl-wikikg2 benchmark where full evaluation times drop from minutes to seconds. The authors provide theoretical justifications and extensive empirical results, showing improved estimation accuracy (lower MAE, stable Kendall-Tau) and substantial speed-ups without sacrificing reliability, and they discuss practical guidance on recommender choice and dataset-specific considerations. Overall, the framework enables fast, reliable benchmarking and faster iteration in large-scale KG research, with potential extensions to training-time sampling and other ranking-style tasks.

Abstract

The standard evaluation protocol for measuring the quality of Knowledge Graph Completion methods - the task of inferring new links to be added to a graph - typically involves a step which ranks every entity of a Knowledge Graph to assess their fit as a head or tail of a candidate link to be added. In Knowledge Graphs on a larger scale, this task rapidly becomes prohibitively heavy. Previous approaches mitigate this problem by using random sampling of entities to assess the quality of links predicted or suggested by a method. However, we show that this approach has serious limitations since the ranking metrics produced do not properly reflect true outcomes. In this paper, we present a thorough analysis of these effects along with the following findings. First, we empirically find and theoretically motivate why sampling uniformly at random vastly overestimates the ranking performance of a method. We show that this can be attributed to the effect of easy versus hard negative candidates. Second, we propose a framework that uses relational recommenders to guide the selection of candidates for evaluation. We provide both theoretical and empirical justification of our methodology, and find that simple and fast methods can work extremely well, and that they match advanced neural approaches. Even when a large portion of true candidates for a property are missed, the estimation barely deteriorates. With our proposed framework, we can reduce the time and computation needed similar to random sampling strategies while vastly improving the estimation; on ogbl-wikikg2, we show that accurate estimations of the full, filtered ranking can be obtained in 20 seconds instead of 30 minutes. We conclude that considerable computational effort can be saved by effective preprocessing and sampling methods and still reliably predict performance accurately of the true performance for the entire ranking procedure.
Paper Structure (38 sections, 1 theorem, 7 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 7 equations, 6 figures, 15 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{KG} \subset \mathcal{E} \times \mathcal{R} \times \mathcal{E}$. For any query $(h, r, ?)$, let $t$ be the true answer and $\mathcal{E}_{(h,r)}$ be all entities having a higher rank than $t$ in a full evaluation. Let $Y$ be the number of positions closer to the true rank when sampling o

Figures (6)

  • Figure 1: A: Normally, one ranks all, even highly irrelevant entities during evaluation. B: Instead, we propose using small, accurate subsets to alleviate the heavy burden of evaluation.
  • Figure 2: L-WD forms a global graph of confidence scores which are then aggregated into scores. These scores are later used to guide the selection of credible entities. In the co-occurrence graph, $D(\cdot)$-nodes represent domains of relations, $R(\cdot)$-nodes represent ranges.
  • Figure 3: Left: a logarithmic scale of the evaluation time (seconds) against the sample size on ogbl-wikikg2. Full evaluation time is the dashed line. Middle: we display the Sample size (% of $|\mathcal{E}|$) compared to Filtered MRR estimate of different sampling strategies, dashed line is true value. Right: The estimated validation MRRs of the different methods across training on ogbl-wikikg2.
  • Figure 4: MAPE (%) on FB15k, CoDEx-M and YAGO3-10.
  • Figure 5: MAPE (%) on FB15k-237 (left) and CoDEx-M (right)
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1