Distributed Graph Embedding with Information-Oriented Random Walks

Peng Fang; Arijit Khan; Siqiang Luo; Fang Wang; Dan Feng; Zhenli Li; Wei Yin; Yuchao Cao

Distributed Graph Embedding with Information-Oriented Random Walks

Peng Fang, Arijit Khan, Siqiang Luo, Fang Wang, Dan Feng, Zhenli Li, Wei Yin, Yuchao Cao

TL;DR

A general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs and improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency.

Abstract

Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines communication, and > 10% effectiveness improvement in downstream tasks.

Distributed Graph Embedding with Information-Oriented Random Walks

TL;DR

Abstract

Paper Structure (26 sections, 1 theorem, 11 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 1 theorem, 11 equations, 13 figures, 10 tables, 1 algorithm.

Introduction
Preliminaries and Baseline
Random-walks Based Graph Embedding
Distributed Random Walks on Graphs
Baseline: HuGE-D
The Proposed System: DistGER
Incremental Information-centric Computing
Multi-Proximity-aware Streaming Partitioning
Distributed Embedding Learning
Challenges and Overview of Our Solution
Proposed Improvements: DSGL
Putting Everything Together
Related Work
Experimental Results
Experimental Setup
...and 11 more sections

Key Result

Theorem 1

Consider an ongoing walk $W^L$ with the current length $L\geq 0$, the next accepted node to be added in $W^L$ is $v$, and $n(v)\geq 0$ is the number of occurrences of $v$ in the walk. In addition to $v$, both $L$ and $n(v)$ would increase by 1. For clarity, we denote $n(v)$ in $W^L$ and $W^{L+1}$ as

Figures (13)

Figure 1: The workflow of our proposed system: DistGER
Figure 2: Incremental computing for information-centric random walk
Figure 3: Schematic diagram of (a) Skip-Gram with negative samples ( SGNS), (b) Pword2vecPword2vec_2019, (c) pSGNSccpSGNSCC_2017, and (d) DSGL (our method)
Figure 4: Workflow of our DSGL: distributed Skip-Gram learning
Figure 5: Efficiency: PBGPBG_2019, DistDGLDistDGL_2020, KnightKingKnighKing_2019, HuGE-D (baseline), DistGER (ours)
...and 8 more figures

Theorems & Definitions (2)

Theorem 1
Example 1

Distributed Graph Embedding with Information-Oriented Random Walks

TL;DR

Abstract

Distributed Graph Embedding with Information-Oriented Random Walks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (2)