HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous Graphs

Yue Zhang; Yuntian He; Saket Gurukar; Srinivasan Parthasarathy

HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous Graphs

Yue Zhang, Yuntian He, Saket Gurukar, Srinivasan Parthasarathy

TL;DR

HeteroMILE tackles the scalability gap in heterogeneous graph embeddings by introducing a generic multi-level framework that coarsens large heterogeneous graphs, embeds on the coarsened graph, and refines embeddings back to the original graph using a heterogenous graph convolutional network. It adds two coarsening strategies (Jaccard similarity and LSH) and a refinement stage that leverages a HGCN with weight-sharing across levels, enabling seamless compatibility with existing base methods such as Metapath2Vec and GATNE. The approach yields substantial runtime reductions (up to ~20x) with maintained or improved embedding quality on link prediction and node classification across four real-world datasets, including the large OGB_MAG graph. These results demonstrate that HeteroMILE provides a practical, scalable solution for learning high-quality embeddings in large heterogeneous graphs without requiring specialized hardware upgrades.

Abstract

Heterogeneous graphs are ubiquitous in real-world applications because they can represent various relationships between different types of entities. Therefore, learning embeddings in such graphs is a critical problem in graph machine learning. However, existing solutions for this problem fail to scale to large heterogeneous graphs due to their high computational complexity. To address this issue, we propose a Multi-Level Embedding framework of nodes on a heterogeneous graph (HeteroMILE) - a generic methodology that allows contemporary graph embedding methods to scale to large graphs. HeteroMILE repeatedly coarsens the large sized graph into a smaller size while preserving the backbone structure of the graph before embedding it, effectively reducing the computational cost by avoiding time-consuming processing operations. It then refines the coarsened embedding to the original graph using a heterogeneous graph convolution neural network. We evaluate our approach using several popular heterogeneous graph datasets. The experimental results show that HeteroMILE can substantially reduce computational time (approximately 20x speedup) and generate an embedding of better quality for link prediction and node classification.

HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous Graphs

TL;DR

Abstract

Paper Structure (24 sections, 6 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 6 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Background and Motivation
Heterogeneous Graph Embedding
Scalable Graph Embedding
Problem Statement
Methodology
Graph Coarsening
Jaccard Similarity Matching Strategy:
Locality Sensitive Hashing (LSH) Matching Strategy:
Choice for Coarsening Level:
Base Embedding
Refinement
Heterogeneous Graph Convolutional Network for Refinement Learning:
Loss Function:
Experiment
...and 9 more sections

Figures (8)

Figure 1: Overview of HeteroMILE framework
Figure 2: Example of matching and merging the nodes
Figure 3: Refinement Process of HeteroMILE
Figure 4: The performance of HeteroMILE using metapath2vec as the base embedding method varies as the number of coarsening levels increases, as depicted by the color scheme. The results for node classification, measured by the Micro-F1 score, and link prediction, measured by AUROC, are presented in the first and second rows, respectively. The running time, displayed in the third row, is plotted on a logarithmic scale. Notably, the running time lines of Jacc_WRS and Jacc_max overlap, similar to LSH (k=128) and LSH (k=256). "level = 0" represents the original embedding method without HeteroMILE.
Figure 5: The performance of HeteroMILE using GATNE as the base embedding method varies as the number of coarsening levels increases, as depicted by the color scheme. The results for node classification, measured by the Micro-F1 score, and link prediction, measured by AUROC, are presented in the first and second rows, respectively. The running time, displayed in the third row, is plotted on a logarithmic scale. Notably, the running time lines of Jacc_WRS and Jacc_max overlap, similar to LSH (k=128) and LSH (k=256). "level = 0" represents the original embedding method without HeteroMILE.
...and 3 more figures

HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous Graphs

TL;DR

Abstract

HeteroMILE: a Multi-Level Graph Representation Learning Framework for Heterogeneous Graphs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)