Table of Contents
Fetching ...

Model-based Subsampling for Knowledge Graph Completion

Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

TL;DR

This work addresses bias in negative sampling for knowledge graph embedding (KGE) caused by count-based subsampling (CBS) in sparse knowledge graphs. It introduces Model-based Subsampling (MBS), which uses predictions from a sub-model to estimate appearance probabilities for triplets, and Mixed Subsampling (MIX), which blends CBS and MBS to leverage their complementary strengths. Across FB15k-237, WN18RR, and YAGO3-10 and multiple KGE backbones (RotatE, TransE, HAKE, ComplEx, DistMult), MBS and MIX improve standard metrics such as MRR and Hits@K, with statistical significance (p < 0.01) in many settings. The results highlight the importance of sub-model choice and hyper-parameter tuning, and point to practical gains in KG completion tasks on sparse datasets, albeit with extra pre-training cost for sub-model selection.

Abstract

Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.

Model-based Subsampling for Knowledge Graph Completion

TL;DR

This work addresses bias in negative sampling for knowledge graph embedding (KGE) caused by count-based subsampling (CBS) in sparse knowledge graphs. It introduces Model-based Subsampling (MBS), which uses predictions from a sub-model to estimate appearance probabilities for triplets, and Mixed Subsampling (MIX), which blends CBS and MBS to leverage their complementary strengths. Across FB15k-237, WN18RR, and YAGO3-10 and multiple KGE backbones (RotatE, TransE, HAKE, ComplEx, DistMult), MBS and MIX improve standard metrics such as MRR and Hits@K, with statistical significance (p < 0.01) in many settings. The results highlight the importance of sub-model choice and hyper-parameter tuning, and point to practical gains in KG completion tasks on sparse datasets, albeit with extra pre-training cost for sub-model selection.

Abstract

Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.
Paper Structure (22 sections, 10 equations, 3 figures, 5 tables)

This paper contains 22 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The averaged KGC performance (MRR) of KGE modelswith and without subsampling on FB15k-237, WN18RR, and YAGO3-10.
  • Figure 2: Frequencies of entities and relations included in each query that appeared only once in training data of FB15k-237, WN18RR, and YAGO3-10.
  • Figure 3: Appearance probabilities (%) of queries in CBS and MBS that have the lowest 100 CBS frequencies for each setting, sorted left to right in descending order by their CBS frequencies.