Table of Contents
Fetching ...

Parallel $k$-Core Decomposition: Theory and Practice

Youzhe Liu, Xiaojun Dong, Yan Gu, Yihan Sun

TL;DR

The paper addresses the challenge of fast, work-efficient parallel $k$-core decomposition on large graphs. It proposes a simple frontier-based framework that achieves $O(n+m)$ work and enhances parallelism through two key techniques: a sampling scheme to reduce contention on high-degree vertices and Vertical Granularity Control (VGC) to hide scheduling overhead, complemented by a Hierarchical Bucketing Structure (HBS) to optimize frontier management. The combined approach yields state-of-the-art performance, with speedups up to $315\times$ over ParK, $33.4\times$ over PKC, and $52.5\times$ over Julienne on 25 graphs, and strong scalability on a 96-core machine across dense and sparse graphs. The work demonstrates that work-efficiency and high parallelism can be achieved together in practical implementations, providing reusable techniques for parallel graph peeling and related problems. These advances enable faster exact $k$-core decompositions in real-world analytics and graph mining tasks.

Abstract

This paper proposes efficient solutions for $k$-core decomposition with high parallelism. The problem of $k$-core decomposition is fundamental in graph analysis and has applications across various domains. However, existing algorithms face significant challenges in achieving work-efficiency in theory and/or high parallelism in practice, and suffer from various performance bottlenecks. We present a simple, work-efficient parallel framework for $k$-core decomposition that is easy to implement and adaptable to various strategies for improving work-efficiency. We introduce two techniques to enhance parallelism: a sampling scheme to reduce contention on high-degree vertices, and vertical granularity control (VGC) to mitigate scheduling overhead for low-degree vertices. Furthermore, we design a hierarchical bucket structure to optimize performance for graphs with high coreness values. We evaluate our algorithm on a diverse set of real-world and synthetic graphs. Compared to state-of-the-art parallel algorithms, including ParK, PKC, and Julienne, our approach demonstrates superior performance on 23 out of 25 graphs when tested on a 96-core machine. Our algorithm shows speedups of up to 315$\times$ over ParK, 33.4$\times$ over PKC, and 52.5$\times$ over Julienne.

Parallel $k$-Core Decomposition: Theory and Practice

TL;DR

The paper addresses the challenge of fast, work-efficient parallel -core decomposition on large graphs. It proposes a simple frontier-based framework that achieves work and enhances parallelism through two key techniques: a sampling scheme to reduce contention on high-degree vertices and Vertical Granularity Control (VGC) to hide scheduling overhead, complemented by a Hierarchical Bucketing Structure (HBS) to optimize frontier management. The combined approach yields state-of-the-art performance, with speedups up to over ParK, over PKC, and over Julienne on 25 graphs, and strong scalability on a 96-core machine across dense and sparse graphs. The work demonstrates that work-efficiency and high parallelism can be achieved together in practical implementations, providing reusable techniques for parallel graph peeling and related problems. These advances enable faster exact -core decompositions in real-world analytics and graph mining tasks.

Abstract

This paper proposes efficient solutions for -core decomposition with high parallelism. The problem of -core decomposition is fundamental in graph analysis and has applications across various domains. However, existing algorithms face significant challenges in achieving work-efficiency in theory and/or high parallelism in practice, and suffer from various performance bottlenecks. We present a simple, work-efficient parallel framework for -core decomposition that is easy to implement and adaptable to various strategies for improving work-efficiency. We introduce two techniques to enhance parallelism: a sampling scheme to reduce contention on high-degree vertices, and vertical granularity control (VGC) to mitigate scheduling overhead for low-degree vertices. Furthermore, we design a hierarchical bucket structure to optimize performance for graphs with high coreness values. We evaluate our algorithm on a diverse set of real-world and synthetic graphs. Compared to state-of-the-art parallel algorithms, including ParK, PKC, and Julienne, our approach demonstrates superior performance on 23 out of 25 graphs when tested on a 96-core machine. Our algorithm shows speedups of up to 315 over ParK, 33.4 over PKC, and 52.5 over Julienne.

Paper Structure

This paper contains 31 sections, 4 theorems, 13 figures, 3 tables, 5 algorithms.

Key Result

Theorem 3.1

Assuming $O(|{\mathcal{F}}\xspace|+\sum_{v\in {\mathcal{F}}\xspace}{d(v)})$ work for the ${\hbox{\sc{Peel}}}\xspace({\mathcal{F}}\xspace,\cdot)$ function on line:process_bucket, and $O(|{\mathcal{A}}\xspace|)$ work for the functions on line:extractline:pack, where ${\mathcal{A}}\xspace$ is the input

Figures (13)

  • Figure 1: An example of $k$-core decomposition with ${k_{\max}}\xspace = 3$. Vertices and edges peeled in each subround are marked as red.
  • Figure 2: Speedup of ParKdasari2014park, PKCkabir2017parallel, Juliennedhulipala2017gbbs2021, and our algorithm, over to the best sequential time (our sequential time or the BZ algorithm time batagelj2003m) on 14 representative graphs. Higher is better. Full results are in \ref{['table:fulltable']}. Numbers below 2 are given on the bars, meaning the parallel code is no more than $2\times$ faster than a sequential one.
  • Figure 3: The peeling process on a grid with and without using VGC. In this example, the queue size is 4. Note that the execution of VCG is not deterministic, and (b) shows a possible execution.
  • Figure 4: The execution of the hierarchical bucketing structure for the first 10 rounds of execution. The number in each box indicates the key (induced degree) range of the associate bucket. The first row shows that a vertex with degree $d$ is initially inserted to bucket $\lceil\log_2 (d+1)\rceil$. When $k=0$ or $1$, vertices with degree 0 or 1 are directly extracted from buckets 0 and 1. We then redistribute vertices in bucket 2 (with degrees 2 or 3) to buckets 0 and 1 (shown in the second row), such that they can be directly identified when $k=2$ or 3. Similarly, after that, we redistributed vertices in bucket 3 (with degrees 4 to 7) to the first three buckets. Vertices with degree 4 and 5 are moved to bucket 0 and 1, respectively, and vertices with degree 6 and 7 are moved to bucket 2, so on so forth.
  • Figure 5: Relative running time of ParKdasari2014park, PKCkabir2017parallel and Juliennedhulipala2017gbbs2021 normalized to our running time (red dotted line) on all graphs. Lower is better. The bars are truncated at 4 for better visualization. The text on the bars are actual relative running time.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Lemma 4.1
  • Theorem 4.2
  • Corollary 4.3