Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

Cunhang Fan; Yujie Chen; Jun Xue; Yonghui Kong; Jianhua Tao; Zhao Lv

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

Cunhang Fan, Yujie Chen, Jun Xue, Yonghui Kong, Jianhua Tao, Zhao Lv

TL;DR

The paper tackles the high computational cost of description-based KGC by introducing Progressive Distillation Based on Masked Generation Features (PMD). PMD combines masked generation feature distillation (MGFD) with a two-stage progressive distillation to transfer rich, inferred representations from a strong teacher to multi-grade student models, achieving state-of-the-art results on WN18RR in pre-distillation and up to 56.7% parameter reduction in the progressive stage. Key losses include $\mathcal{L}_{CE}$, $\mathcal{L}_{SCORE}$, and $\mathcal{L}_{MGFD}$, balanced by hyperparameters $\alpha$ and $\beta$, with the total loss $\mathcal{L}=(1-\alpha-\beta)\mathcal{L}_{CE}+\alpha\mathcal{L}_{SCORE}+\beta\mathcal{L}_{MGFD}$. Extensive experiments on WN18RR and FB15K-237, plus ablations, demonstrate that MGFD enhances representation transfer and that progressive distillation maintains performance while dramatically reducing parameters, enabling efficient description-based KGC in resource-constrained settings.

Abstract

In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. However, traditional feature distillation suffers from the limitation of having a single representation of information in teacher models. To solve this problem, we propose masked generation of teacher-student features, which contain richer representation information. Furthermore, there is a significant gap in representation ability between teacher and student. Therefore, we design a progressive distillation method to distill student models at each grade level, enabling efficient knowledge transfer from teachers to students. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods. Furthermore, in the progressive distillation stage, the model significantly reduces the model parameters while maintaining a certain level of performance. Specifically, the model parameters of the lower-grade student model are reduced by 56.7\% compared to the baseline.

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

TL;DR

, and

, balanced by hyperparameters

and

, with the total loss

. Extensive experiments on WN18RR and FB15K-237, plus ablations, demonstrate that MGFD enhances representation transfer and that progressive distillation maintains performance while dramatically reducing parameters, enabling efficient description-based KGC in resource-constrained settings.

Abstract

Paper Structure (17 sections, 6 equations, 3 figures, 4 tables)

This paper contains 17 sections, 6 equations, 3 figures, 4 tables.

Introduction
Related Work
Knowledge Graph Completion
Knowledge Distillation
Methodology
Definitions and Notation
Masked Generation Feature Distillation
Progressive Distillation Framework
Experiments
Experimental Setup
Main Result
Ablation
Q1: Is PMD More Efficient Than Common Distillation Strategies?
Q2: Are Both Progressive Distillation Module and MGFD Module Useful?
Q3: What Effects Do Different Mask Rates in the MGFD Module Have?
...and 2 more sections

Figures (3)

Figure 1: This figure illustrates the overall architecture of the PMD. ($\mathrm{i}$) MGFD applies masking operations to input tokens and sets an appropriate masking rate based on student model parameter count ($\mathrm{ii}$) In pre-distillation stage, the performance of the initial model is improved. ($\mathrm{iii}$) In progressive distillation Stage involves the design of multi-grade student models with gradually reduced parameter count and mask rate. ($\mathrm{iv}$) Each student model is trained under three kinds of supervision as depicted.
Figure 2: Comparison experiments between the diminishing mask rate and the fixed mask rate.
Figure 3: Hits@1 and Hits@10 indicators of the $\mathrm{PMD_{12}}$ with mask rates from 0% to 50%.

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

TL;DR

Abstract

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)