Table of Contents
Fetching ...

DiFuseR: A Distributed Sketch-based Influence Maximization Algorithm for GPUs

Gökhan Göktürk, Kamer Kaya

TL;DR

DiFuseR is a blazing-fast, high-quality IM algorithm that can run on multiple GPUs in a distributed setting and is designed to increase GPU utilization, reduce internode communication, and minimize overlapping data/computation among the nodes.

Abstract

Influence Maximization (IM) aims to find a given number of "seed" vertices that can effectively maximize the expected spread under a given diffusion model. Due to the NP-Hardness of finding an optimal seed set, approximation algorithms are often used for IM. However, these algorithms require a large number of simulations to find good seed sets. In this work, we propose DiFuseR, a blazing-fast, high-quality IM algorithm that can run on multiple GPUs in a distributed setting. DiFuseR is designed to increase GPU utilization, reduce inter-node communication, and minimize overlapping data/computation among the nodes. Based on the experiments with various graphs, containing some of the largest networks available, and diffusion settings, the proposed approach is found to be 3.2x and 12x faster on average on a single GPU and 8 GPUs, respectively. It can achieve up to 8x and 233.7x speedup on the same hardware settings. Furthermore, thanks to its smart load-balancing mechanism, on 8 GPUs, it is on average 5.6x faster compared to its single-GPU performance.

DiFuseR: A Distributed Sketch-based Influence Maximization Algorithm for GPUs

TL;DR

DiFuseR is a blazing-fast, high-quality IM algorithm that can run on multiple GPUs in a distributed setting and is designed to increase GPU utilization, reduce internode communication, and minimize overlapping data/computation among the nodes.

Abstract

Influence Maximization (IM) aims to find a given number of "seed" vertices that can effectively maximize the expected spread under a given diffusion model. Due to the NP-Hardness of finding an optimal seed set, approximation algorithms are often used for IM. However, these algorithms require a large number of simulations to find good seed sets. In this work, we propose DiFuseR, a blazing-fast, high-quality IM algorithm that can run on multiple GPUs in a distributed setting. DiFuseR is designed to increase GPU utilization, reduce inter-node communication, and minimize overlapping data/computation among the nodes. Based on the experiments with various graphs, containing some of the largest networks available, and diffusion settings, the proposed approach is found to be 3.2x and 12x faster on average on a single GPU and 8 GPUs, respectively. It can achieve up to 8x and 233.7x speedup on the same hardware settings. Furthermore, thanks to its smart load-balancing mechanism, on 8 GPUs, it is on average 5.6x faster compared to its single-GPU performance.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 9 tables, 4 algorithms.

Figures (5)

  • Figure 1: \ref{['fig:ic']} The directed graph $G = (V, E)$ for IC with independent diffusion probabilities. \ref{['fig:wc']} The directed graph for WC is obtained by setting the diffusion probabilities of incoming edges to $1 / |\Gamma^-_G(v)|$ for each vertex $v \in V$.
  • Figure 2: \ref{['fig:sims']} Two sampled subgraphs of a toy graph with 4 vertices and 6 edges. \ref{['fig:fused']} The simulations performed are fused with sampling. Each edge is labeled with the corresponding sample/simulation IDs.
  • Figure 3: The reduction scheme of DiFuseR to find a seed vertex using sketch registers. The scheme uses 4 machines with 8 GPUs (2 GPUs each), and ${\cal J}=256$ registers. M[$j$] represents a sketch with $j$ registers. The symbol $+$ represents the element-wise reduction of registers.
  • Figure 4: An example run of a seed selection in DiFuseR; The example scheme uses 4 machines with 8 GPUs (2 GPUs each) and ${\cal J} = 256$ registers. M[$j$] represents a sketch with $j$ registers. Corresponding lines on Algo.4. are given on the right of the related steps.
  • Figure 5: An example run of fused sampling on edge $e$, using 4 samples and 2 threads wide (toy) warps. Whether the edge $e$ is sampled, e.g. $e \in \hat{G}$, is computed using the formula; $X \oplus h(e) < W_e$. $P_e$ and $W_e$ use a fixed-point representation. (a) uses naive random values where samples are evenly distributed amongst warps, causing divergent paths in warp 0 and 1. (b) uses FASST where either no edges are sampled or warp is fully utilized without any divergence. Warp 0 doesn't have any samples so an early exit is available. In Warp 1, all samples are active, which allows full utilization of the warp.