Table of Contents
Fetching ...

OMEGA: A Low-Latency GNN Serving System for Large Graphs

Geon-Woo Kim, Donghyun Kim, Jeongyoon Moon, Henry Liu, Tarannum Khan, Anand Iyer, Daehyeok Kim, Aditya Akella

TL;DR

Omega tackles the challenge of low-latency GNN serving on billion-node graphs by combining Selective Recomputation of Precomputed Embeddings (SRPE) with Computation Graph Parallelism (CGP). SRPE reuses precomputed embeddings for most neighbors while selectively recomputing a small, error-prone subset to limit accuracy loss, guided by a principled probabilistic policy. CGP distributes both the creation and execution of computation graphs across multiple machines, using local aggregations, all-to-all communications, and custom merge functions to minimize cross-machine data transfer. Together, SRPE and CGP yield dramatic latency reductions (up to $159\times$ vs full-graph baselines and up to $10.8\times$ vs sampling-based baselines) with minimal accuracy loss, enabling scalable GNN serving for large-scale graphs in practical deployments.

Abstract

Graph Neural Networks (GNNs) have been widely adopted for their ability to compute expressive node representations in graph datasets. However, serving GNNs on large graphs is challenging due to the high communication, computation, and memory overheads of constructing and executing computation graphs, which represent information flow across large neighborhoods. Existing approximation techniques in training can mitigate the overheads but, in serving, still lead to high latency and/or accuracy loss. To this end, we propose OMEGA, a system that enables low-latency GNN serving for large graphs with minimal accuracy loss through two key ideas. First, OMEGA employs selective recomputation of precomputed embeddings, which allows for reusing precomputed computation subgraphs while selectively recomputing a small fraction to minimize accuracy loss. Second, we develop computation graph parallelism, which reduces communication overhead by parallelizing the creation and execution of computation graphs across machines. Our evaluation with large graph datasets and GNN models shows that OMEGA significantly outperforms state-of-the-art techniques.

OMEGA: A Low-Latency GNN Serving System for Large Graphs

TL;DR

Omega tackles the challenge of low-latency GNN serving on billion-node graphs by combining Selective Recomputation of Precomputed Embeddings (SRPE) with Computation Graph Parallelism (CGP). SRPE reuses precomputed embeddings for most neighbors while selectively recomputing a small, error-prone subset to limit accuracy loss, guided by a principled probabilistic policy. CGP distributes both the creation and execution of computation graphs across multiple machines, using local aggregations, all-to-all communications, and custom merge functions to minimize cross-machine data transfer. Together, SRPE and CGP yield dramatic latency reductions (up to vs full-graph baselines and up to vs sampling-based baselines) with minimal accuracy loss, enabling scalable GNN serving for large-scale graphs in practical deployments.

Abstract

Graph Neural Networks (GNNs) have been widely adopted for their ability to compute expressive node representations in graph datasets. However, serving GNNs on large graphs is challenging due to the high communication, computation, and memory overheads of constructing and executing computation graphs, which represent information flow across large neighborhoods. Existing approximation techniques in training can mitigate the overheads but, in serving, still lead to high latency and/or accuracy loss. To this end, we propose OMEGA, a system that enables low-latency GNN serving for large graphs with minimal accuracy loss through two key ideas. First, OMEGA employs selective recomputation of precomputed embeddings, which allows for reusing precomputed computation subgraphs while selectively recomputing a small fraction to minimize accuracy loss. Second, we develop computation graph parallelism, which reduces communication overhead by parallelizing the creation and execution of computation graphs across machines. Our evaluation with large graph datasets and GNN models shows that OMEGA significantly outperforms state-of-the-art techniques.
Paper Structure (28 sections, 1 theorem, 7 equations, 18 figures, 6 tables)

This paper contains 28 sections, 1 theorem, 7 equations, 18 figures, 6 tables.

Key Result

Theorem 1

The sum of the variances of every dimension of the estimators ($\sum_u \sum_{l=1}^{k-1} \hat{f}_u^{(l)}$) is minimized when $p_u \propto ||\sum_{l=1}^{k-1} q_u^{(l)}|| = ||\sum_{l=1}^{k-1}\sum_{v \in \mathit{N_Q}(u)} \frac{m_v^{(l)}}{|\mathit{N}(u)|}||$, given $\gamma = \sum_{u} p_u$.

Figures (18)

  • Figure 1: (Left) An example graph dataset with 8 nodes and F-dimensional feature vectors. We use this graph dataset as our running example. (Right) The 2-hop computation graph for node 0 where the boxes below represent feature vectors.
  • Figure 2: Distributed GNN serving end-to-end workflow. To generate the embeddings of (batched) query nodes, the master forwards a serving request to one machine. A computation graph builder then creates $k$-hop computation graphs for the query nodes by loading from the local partition and fetching required edges and feature vectors from remote partitions. The red line represents remote communication. The computation graphs are then executed by a GNN executor after being copied into GPU device memory. Finally, the embeddings of the query nodes are returned.
  • Figure 3: (Left) Latency breakdown of \ref{['tab:acc-lat-tradeoff']}. The data size in each bar indicates the amount of feature vectors and edges fetched. (Right) Trendlines of peak FP32 FLOPs of NVIDIA GPUs nvidia-p100nvidia-v100nvidia-a100nvidia-h100 and bandwidth of NVIDIA NICs connectx-4connectx-5connectx-6.
  • Figure 4: High-level illustration of (Left) Selective Recomputation of Precomputed Embeddings (SRPE) and (Right) Computation Graph Parallelism (CGP). The query node 8 is connected to the existing nodes 2 and 3 of the example graph dataset in \ref{['fig:background1']}.
  • Figure 5: Omega end-to-end workflow.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Theorem 1