Table of Contents
Fetching ...

Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning

Félix Lefebvre, Gaël Varoquaux

TL;DR

SEPAL tackles the scalability gap in knowledge-graph embeddings by learning high-quality, globally consistent embeddings from a small, dense core and propagating them to the rest of a huge graph through relation-aware message passing. It introduces BLOCS to split ultra-large graphs into balanced, overlapping subgraphs, enabling single-GPU training while preserving connectivity and relational coverage. Theoretical analysis links SEPAL’s propagation to global embedding alignment and an Arnoldi-like evolution, and empirical results on 7 large KG datasets and 46 downstream tasks show strong downstream performance and substantial speedups over baseline large-scale KGEs. The approach reduces engineering overhead, scales to graphs with hundreds of millions of triples on commodity hardware, and remains adaptable to various base KGE models, with potential for continual learning.

Abstract

Many machine learning tasks can benefit from external knowledge. Large knowledge graphs store such knowledge, and embedding methods can be used to distill it into ready-to-use vector representations for downstream applications. For this purpose, current models have however two limitations: they are primarily optimized for link prediction, via local contrastive learning, and their application to the largest graphs requires significant engineering effort due to GPU memory limits. To address these, we introduce SEPAL: a Scalable Embedding Propagation ALgorithm for large knowledge graphs designed to produce high-quality embeddings for downstream tasks at scale. The key idea of SEPAL is to ensure global embedding consistency by optimizing embeddings only on a small core of entities, and then propagating them to the rest of the graph with message passing. We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware.

Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning

TL;DR

SEPAL tackles the scalability gap in knowledge-graph embeddings by learning high-quality, globally consistent embeddings from a small, dense core and propagating them to the rest of a huge graph through relation-aware message passing. It introduces BLOCS to split ultra-large graphs into balanced, overlapping subgraphs, enabling single-GPU training while preserving connectivity and relational coverage. Theoretical analysis links SEPAL’s propagation to global embedding alignment and an Arnoldi-like evolution, and empirical results on 7 large KG datasets and 46 downstream tasks show strong downstream performance and substantial speedups over baseline large-scale KGEs. The approach reduces engineering overhead, scales to graphs with hundreds of millions of triples on commodity hardware, and remains adaptable to various base KGE models, with potential for continual learning.

Abstract

Many machine learning tasks can benefit from external knowledge. Large knowledge graphs store such knowledge, and embedding methods can be used to distill it into ready-to-use vector representations for downstream applications. For this purpose, current models have however two limitations: they are primarily optimized for link prediction, via local contrastive learning, and their application to the largest graphs requires significant engineering effort due to GPU memory limits. To address these, we introduce SEPAL: a Scalable Embedding Propagation ALgorithm for large knowledge graphs designed to produce high-quality embeddings for downstream tasks at scale. The key idea of SEPAL is to ensure global embedding consistency by optimizing embeddings only on a small core of entities, and then propagating them to the rest of the graph with message passing. We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware.

Paper Structure

This paper contains 98 sections, 1 theorem, 25 equations, 23 figures, 15 tables, 1 algorithm.

Key Result

Proposition 4.1

Let $\mathcal{E}$ be the "alignment energy" defined as with $\phi(\boldsymbol\theta_h, \boldsymbol{w}_r) = \boldsymbol\theta_h\odot\boldsymbol{w}_r$ being the DistMult relational operator. Then, SEPAL's propagation step amounts to a mini-batch projected gradient step descending $\mathcal{E}$ under the following conditions: As a consequence, SEPAL converges towards a stationary point of $\mathcal

Figures (23)

  • Figure 1: SEPAL's embedding pipeline. First, a core subgraph is extracted from the input knowledge graph (step 1.a). BLOCS then subdivides this input knowledge graph into outer subgraphs (step 1.b). Next, the core subgraph is embedded using traditional KGE models, which generate vector representations for both core entities and relations (step 2.). Finally, these embeddings are propagated with message passing to each outer subgraph successively (step 3.).
  • Figure 2: Statistical performance on real-world tables.a) Pareto frontiers of averaged normalized prediction scores with respect to embedding times (log-scale). b) Critical difference diagramsterpilowski2019scikit of average ranks among the three methods (SEPAL, PBG and DGL-KE) that scale to every knowledge-graph dataset. The ranks are averaged over all tasks; a task being defined as the combination of a downstream table and a source knowledge graph. SEPAL gets the best average downstream performance for each of the 7 source knowledge graphs. \ref{['fig:rw-hbar']} gives the detailed results for each table. Appendix \ref{['app:dt_setup']} details the metric used.
  • Figure 3: Statistical performance on WikiDBs tables. a) Critical difference diagrams of scalable methods. Black lines connect methods that are not significantly different. b) Pareto frontiers of averaged normalized prediction scores with respect to embedding times (log-scale). \ref{['fig:wkdb-violin']} in \ref{['app:wkdb-results']} provides the detailed results for each of the 38 test tables.
  • Figure 4: Entity coverage of downstream tables. Over the 46 downstream tables 4 are used for validation (in blue), and 42 are used for test (in maroon).
  • Figure 5: Detailed results on real-world tables. The "Cumulative normalized mean cross-validation score" reported is obtained by summing the normalized mean cross-validation scores. For an evaluation dataset, 1 corresponds to the best R2 score across all models; as there are 4 evaluation datasets, the highest possible score for a model is 4 (getting a score of 4 means that the model beats every model on every evaluation dataset). SEPAL, PyTorch-BigGraph, DGL-KE, and NodePiece use DistMult as base model. Embedding computation times are provided on the right-hand side of the figure. \ref{['fig:full-downstream']} extends this figure with other embedding models.
  • ...and 18 more figures

Theorems & Definitions (2)

  • Proposition 4.1: Implicit Gradient Descent
  • proof : Proof sketch