Table of Contents
Fetching ...

SketchNE: Embedding Billion-Scale Networks Accurately in One Hour

Yuyang Xie, Yuxiao Dong, Jiezhong Qiu, Wenjian Yu, Xu Feng, Jie Tang

TL;DR

SketchNE addresses the challenge of embedding billion-scale networks on CPU-only hardware by reframing NetMF as factorization of an element-wise function of a low-rank product and removing the need to form the dense matrix. It introduces a sparse-sign randomized single-pass SVD to compute a low-rank factorization without constructing $f^{\circ}(\boldsymbol{L}\boldsymbol{R})$, and a fast randomized eigen-decomposition on a modified Laplacian to approximate $\boldsymbol{L}$ and $\boldsymbol{R}$, together with spectral propagation and GBBS-based memory optimizations. The method achieves linear-time scaling in the number of edges and vertices with memory $O(m+nk)$, enabling CPU-based embedding of networks like Hyperlink2012 (3.5B vertices, 225B edges) in about 1 hour and delivering substantial gains over baselines on vertex classification and link prediction. Empirically, SketchNE outperforms nine large baselines across diverse datasets while using less memory and less wall-clock time on billion-scale graphs, highlighting its practical impact for real-world, up-to-scale graph representation learning.

Abstract

We study large-scale network embedding with the goal of generating high-quality embeddings for networks with more than 1 billion vertices and 100 billion edges. Recent attempts LightNE and NetSMF propose to sparsify and factorize the (dense) NetMF matrix for embedding large networks, where NetMF is a theoretically-grounded network embedding method. However, there is a trade-off between their embeddings' quality and scalability due to their expensive memory requirements, making embeddings less effective under real-world memory constraints. Therefore, we present the SketchNE model, a scalable, effective, and memory-efficient network embedding solution developed for a single machine with CPU only. The main idea of SketchNE is to avoid the explicit construction and factorization of the NetMF matrix either sparsely or densely when producing the embeddings through the proposed sparse-sign randomized single-pass SVD algorithm. We conduct extensive experiments on nine datasets of various sizes for vertex classification and link prediction, demonstrating the consistent outperformance of SketchNE over state-of-the-art baselines in terms of both effectiveness and efficiency. SketchNE costs only 1.0 hours to embed the Hyperlink2012 network with 3.5 billion vertices and 225 billion edges on a CPU-only single machine with embedding superiority (e.g., a 282% relative HITS@10 gain over LightNE).

SketchNE: Embedding Billion-Scale Networks Accurately in One Hour

TL;DR

SketchNE addresses the challenge of embedding billion-scale networks on CPU-only hardware by reframing NetMF as factorization of an element-wise function of a low-rank product and removing the need to form the dense matrix. It introduces a sparse-sign randomized single-pass SVD to compute a low-rank factorization without constructing , and a fast randomized eigen-decomposition on a modified Laplacian to approximate and , together with spectral propagation and GBBS-based memory optimizations. The method achieves linear-time scaling in the number of edges and vertices with memory , enabling CPU-based embedding of networks like Hyperlink2012 (3.5B vertices, 225B edges) in about 1 hour and delivering substantial gains over baselines on vertex classification and link prediction. Empirically, SketchNE outperforms nine large baselines across diverse datasets while using less memory and less wall-clock time on billion-scale graphs, highlighting its practical impact for real-world, up-to-scale graph representation learning.

Abstract

We study large-scale network embedding with the goal of generating high-quality embeddings for networks with more than 1 billion vertices and 100 billion edges. Recent attempts LightNE and NetSMF propose to sparsify and factorize the (dense) NetMF matrix for embedding large networks, where NetMF is a theoretically-grounded network embedding method. However, there is a trade-off between their embeddings' quality and scalability due to their expensive memory requirements, making embeddings less effective under real-world memory constraints. Therefore, we present the SketchNE model, a scalable, effective, and memory-efficient network embedding solution developed for a single machine with CPU only. The main idea of SketchNE is to avoid the explicit construction and factorization of the NetMF matrix either sparsely or densely when producing the embeddings through the proposed sparse-sign randomized single-pass SVD algorithm. We conduct extensive experiments on nine datasets of various sizes for vertex classification and link prediction, demonstrating the consistent outperformance of SketchNE over state-of-the-art baselines in terms of both effectiveness and efficiency. SketchNE costs only 1.0 hours to embed the Hyperlink2012 network with 3.5 billion vertices and 225 billion edges on a CPU-only single machine with embedding superiority (e.g., a 282% relative HITS@10 gain over LightNE).

Paper Structure

This paper contains 14 sections, 1 theorem, 12 equations, 7 figures, 6 tables, 6 algorithms.

Key Result

Theorem 1

Suppose $f^{\circ}$ denotes $\mathrm{trunc\_log}^{\circ}$, i.e. the element-wise truncated logarithm, $f^{\circ}(\boldsymbol{M})$ is the matrix in (1), and $f^{\circ}(\boldsymbol{L^{\prime}R^{\prime}})$ is defined by (11) which includes the quantities obtained with Alg. 5. Then, with high probability. Here $|\lambda_j|$ is the $j$-th largest absolute value of eigenvalue of $\boldsymbol{D}^{-\alph

Figures (7)

  • Figure 1: The overview of SketchNE vs. NetMF and NetSMF/LightNE. The symbols used are listed in Table \ref{['tab:notation']}.
  • Figure 2: Vertex classification performance (Micro-F1 and Macro-F1) w.r.t. the ratio of training data. For methods that cannot handle computation or cannot finish job in one day, the results are not available and thus not plotted in this figure.
  • Figure 3: The validation of the effectiveness of freigs.
  • Figure 4: The embedding performance comparison.
  • Figure 5: The trade-offs between efficiency and performance.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof