Table of Contents
Fetching ...

Scaling Inter-procedural Dataflow Analysis on the Cloud

Zewen Sun, Yujin Zhang, Duanchen Xu, Yiyu Zhang, Yun Qi, Yueyang Wang, Yi Li, Zhaokang Wang, Yue Li, Xuandong Li, Zhiqiang Zuo, Qingda Lu, Wenwen Peng, Shengjian Guo

TL;DR

This work presents BigDataflow, a cloud-based framework for scalable interprocedural dataflow analysis. By reimagining the traditional worklist algorithm as distributed vertex-centric computation on Apache Giraph, it delivers two modes: whole-program and incremental analysis, supported by optimized data movement and correctness proofs. Empirical evaluation on large real-world software (Linux, Firefox, PostgreSQL, OpenSSL, Httpd) shows substantial speedups over single-machine approaches, with memory demands in the multi-terabyte range that are mitigated by the cloud hardware model. Incremental analysis further reduces time and cost by confining recomputation to affected subgraphs, aligning well with modern CI/CD workflows. Overall, BigDataflow demonstrates how large-scale, context-sensitive dataflow analyses can be made practical and responsive on distributed cloud resources.

Abstract

Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging. In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis. Inspired by large-scale graph processing, we devise dedicated distributed worklist algorithms for both whole-program analysis and incremental analysis. We implement these algorithms and develop a distributed framework called BigDataflow running on a large-scale cluster. The experimental results validate the promising performance of BigDataflow -- BigDataflow can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.

Scaling Inter-procedural Dataflow Analysis on the Cloud

TL;DR

This work presents BigDataflow, a cloud-based framework for scalable interprocedural dataflow analysis. By reimagining the traditional worklist algorithm as distributed vertex-centric computation on Apache Giraph, it delivers two modes: whole-program and incremental analysis, supported by optimized data movement and correctness proofs. Empirical evaluation on large real-world software (Linux, Firefox, PostgreSQL, OpenSSL, Httpd) shows substantial speedups over single-machine approaches, with memory demands in the multi-terabyte range that are mitigated by the cloud hardware model. Incremental analysis further reduces time and cost by confining recomputation to affected subgraphs, aligning well with modern CI/CD workflows. Overall, BigDataflow demonstrates how large-scale, context-sensitive dataflow analyses can be made practical and responsive on distributed cloud resources.

Abstract

Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging. In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis. Inspired by large-scale graph processing, we devise dedicated distributed worklist algorithms for both whole-program analysis and incremental analysis. We implement these algorithms and develop a distributed framework called BigDataflow running on a large-scale cluster. The experimental results validate the promising performance of BigDataflow -- BigDataflow can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.

Paper Structure

This paper contains 31 sections, 3 theorems, 4 equations, 9 figures, 4 tables, 7 algorithms.

Key Result

Theorem 1

Given an active vertex $k$, let $preds(k)$ be the set of $k$'s predecessors. Without loss of generality, suppose at the previous superstep, a partial set of $k$'s predecessors i.e., $P'(k) \subseteq preds(k)$ update their outgoing dataflow facts, while the outgoing facts of the remaining i.e., $P(k) i.e.,

Figures (9)

  • Figure 1: One superstep computation at vertex 4 in Algorithm \ref{['a:naive-algo']}.
  • Figure 2: One superstep computation at vertex 4 in Algorithm \ref{['a:opt-algo']}.
  • Figure 3: Workflow of Distributed Incremental Dataflow Analysis.
  • Figure 4: Atomic changes on CFG.
  • Figure 5: Sub-CFG for incremental update in Algorithm \ref{['a:reachability-naive-algo']}.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 1: Accumulative Property
  • Theorem 2: Incremental Property
  • Theorem 3: Consistency on $\mathcal{G}_{\textit{add\_only}}$