Table of Contents
Fetching ...

Fast and Space-Efficient Parallel Algorithms for Influence Maximization

Letong Wang, Xiangyun Ding, Yan Gu, Yihan Sun

TL;DR

This work advances Influence Maximization by introducing sketch compression for the Independent Cascade model on undirected graphs and two parallel data structures (P-tree and Win-Tree) to accelerate seed selection. The PaC-IM framework unifies sketch-based efficiency with parallelism, achieving state-of-the-art speed and space and scaling to graphs with billions of edges. The key contributions are a controllable space-time tradeoff via center-based sketch compression, and practical, parallel CELF enhancements that maintain solution quality while delivering strong scalability. The results demonstrate substantial speedups over baselines and the ability to handle previously intractable, large-scale graphs, highlighting the framework's impact for large-scale diffusion analysis and marketing scenarios.

Abstract

Influence Maximization (IM) is a crucial problem in data science. The goal is to find a fixed-size set of highly-influential seed vertices on a network to maximize the influence spread along the edges. While IM is NP-hard on commonly-used diffusion models, a greedy algorithm can achieve $(1-1/e)$-approximation, repeatedly selecting the vertex with the highest marginal gain in influence as the seed. Due to theoretical guarantees, rich literature focuses on improving the performance of the greedy algorithm. To estimate the marginal gain, existing work either runs Monte Carlo (MC) simulations of influence spread or pre-stores hundreds of sketches (usually per-vertex information). However, these approaches can be inefficient in time (MC simulation) or space (storing sketches), preventing the ideas from scaling to today's large-scale graphs. This paper significantly improves the scalability of IM using two key techniques. The first is a sketch-compression technique for the independent cascading model on undirected graphs. It allows combining the simulation and sketching approaches to achieve a time-space tradeoff. The second technique includes new data structures for parallel seed selection. Using our new approaches, we implemented PaC-IM: Parallel and Compressed IM. We compare PaC-IM with state-of-the-art parallel IM systems on a 96-core machine with 1.5TB memory. PaC-IM can process large-scale graphs with up to 900M vertices and 74B edges in about 2 hours. On average across all tested graphs, our uncompressed version is 5--18$\times$ faster and about 1.4$\times$ more space-efficient than existing parallel IM systems. Using compression further saves 3.8$\times$ space with only 70% overhead in time on average.

Fast and Space-Efficient Parallel Algorithms for Influence Maximization

TL;DR

This work advances Influence Maximization by introducing sketch compression for the Independent Cascade model on undirected graphs and two parallel data structures (P-tree and Win-Tree) to accelerate seed selection. The PaC-IM framework unifies sketch-based efficiency with parallelism, achieving state-of-the-art speed and space and scaling to graphs with billions of edges. The key contributions are a controllable space-time tradeoff via center-based sketch compression, and practical, parallel CELF enhancements that maintain solution quality while delivering strong scalability. The results demonstrate substantial speedups over baselines and the ability to handle previously intractable, large-scale graphs, highlighting the framework's impact for large-scale diffusion analysis and marketing scenarios.

Abstract

Influence Maximization (IM) is a crucial problem in data science. The goal is to find a fixed-size set of highly-influential seed vertices on a network to maximize the influence spread along the edges. While IM is NP-hard on commonly-used diffusion models, a greedy algorithm can achieve -approximation, repeatedly selecting the vertex with the highest marginal gain in influence as the seed. Due to theoretical guarantees, rich literature focuses on improving the performance of the greedy algorithm. To estimate the marginal gain, existing work either runs Monte Carlo (MC) simulations of influence spread or pre-stores hundreds of sketches (usually per-vertex information). However, these approaches can be inefficient in time (MC simulation) or space (storing sketches), preventing the ideas from scaling to today's large-scale graphs. This paper significantly improves the scalability of IM using two key techniques. The first is a sketch-compression technique for the independent cascading model on undirected graphs. It allows combining the simulation and sketching approaches to achieve a time-space tradeoff. The second technique includes new data structures for parallel seed selection. Using our new approaches, we implemented PaC-IM: Parallel and Compressed IM. We compare PaC-IM with state-of-the-art parallel IM systems on a 96-core machine with 1.5TB memory. PaC-IM can process large-scale graphs with up to 900M vertices and 74B edges in about 2 hours. On average across all tested graphs, our uncompressed version is 5--18 faster and about 1.4 more space-efficient than existing parallel IM systems. Using compression further saves 3.8 space with only 70% overhead in time on average.
Paper Structure (14 sections, 5 theorems, 2 equations, 9 figures, 7 tables, 5 algorithms)

This paper contains 14 sections, 5 theorems, 2 equations, 9 figures, 7 tables, 5 algorithms.

Key Result

Theorem 3.1

PaC-IM with parameter $\alpha$ requires $O((1+\alpha R)n)$ space to maintain $R$ sketches, and visits $O(R\cdot \min(1/\alpha,T))$ vertices to re-evaluate the marginal gain of one vertex $v$, where $T$ is the average CC size of $v$ on all sketches.

Figures (9)

  • Figure 1: Heatmap of relative running time and space usage, normalized to Ours$_{1}$. Ours$_{1}$: PaC-IM with no compression. Ours$_{0.1}$: PaC-IM with $10\times$ sketch compression. InfuserMGgokturk2020boosting and Ripplesminutoli2019fast: existing parallel IM systems. Lower/green is better. The graph information is in \ref{['tab:graph_info']}. The running times are in \ref{['tab:baselines']}. $*$: graphs with more than a billion edges.
  • Figure 2: An example of our sketch compression on a graph with 8 vertices (B and F as centers) and $R=2$ sampled graphs.
  • Figure 3: The number of evaluations by CELF in each round. A point $(x,y)$ means that CELF does $y$ re-evaluations in the $x$-th round.
  • Figure 4: Example of P-tree-Based Seed Selection. The letters in the tree nodes represent vertices, and the numbers below them are their stale scores. P-tree maintains decreasing order of the (stale) scores. By prefix-doubling, we extract batches of 1, 2, 4 vertices and evaluate each batch in parallel. After batch 3, the highest true score (13) is higher than the current best in the tree (10), and the algorithm stops. We will select the node with the highest true score and insert the rest back to the tree with their new score. P-tree may evaluate more vertices than CELF, but the extra work can be bounded (\ref{['lemma:efficiency']}).
  • Figure 5: Seed selection based on Win-Tree. Each leaf stores a vertex id, and each internal node stores the vertex in its subtree with the highest (stale) score. (a) An example of finding the maximum score. For illustration purposes, we assume the parallel threads work at the same speed and all tree nodes on the same level are processed in parallel (in reality the threads run asynchronously in fork-join parallelism). Therefore, the subtree at $G$ will see $\Delta^*=14$ updated by $D$, and this subtree will be skipped. (b) Updating the internal nodes with the new $\bar{\Delta}[\cdot]$ values. Finally, the root of the tree has the highest true score.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Theorem 4.1: P-tree Correctness
  • Theorem 4.2: P-tree Efficiency
  • Theorem 4.3: P-tree Cost Bound
  • Theorem 4.4: Win-Tree Correctness