Table of Contents
Fetching ...

On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

Amir M. Mir, Mehdi Keshani, Sebastian Proksch

TL;DR

The work tackles the challenge of imprecise static CGs by introducing NYXCorpus and evaluating ML-based pruning under conservative training and inference settings. By comparing $0$-$CFA$ and $1$-$CFA$ CGs and leveraging CodeBERT/CodeT5-based semantic features, the study demonstrates substantial precision gains with manageable recall loss, and shows pruned CGs can match or approach the quality of context-sensitive analyses while offering significant speedups and smaller CG sizes. The results emphasize the practical viability of pruning for security analyses, where paranoid configurations can near baseline coverage with up to $3.5\times$ faster analyses on $69\%$ smaller CGs. The findings highlight the importance of addressing data imbalance, exploring hybrid static analysis approaches, and refining feature engineering to improve both recall and precision in real-world software engineering tasks.

Abstract

Static call graph (CG) construction often over-approximates call relations, leading to sound, but imprecise results. Recent research has explored machine learning (ML)-based CG pruning as a means to enhance precision by eliminating false edges. However, current methods suffer from a limited evaluation dataset, imbalanced training data, and reduced recall, which affects practical downstream analyses. Prior results were also not compared with advanced static CG construction techniques yet. This study tackles these issues. We introduce the NYXCorpus, a dataset of real-world Java programs with high test coverage and we collect traces from test executions and build a ground truth of dynamic CGs. We leverage these CGs to explore conservative pruning strategies during the training and inference of ML-based CG pruners. We conduct a comparative analysis of static CGs generated using zero control flow analysis (0-CFA) and those produced by a context-sensitive 1-CFA algorithm, evaluating both with and without pruning. We find that CG pruning is a difficult task for real-world Java projects and substantial improvements in the CG precision (+25%) meet reduced recall (-9%). However, our experiments show promising results: even when we favor recall over precision by using an F2 metric in our experiments, we can show that pruned CGs have comparable quality to a context-sensitive 1-CFA analysis while being computationally less demanding. Resulting CGs are much smaller (69%), and substantially faster (3.5x speed-up), with virtually unchanged results in our downstream analysis.

On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

TL;DR

The work tackles the challenge of imprecise static CGs by introducing NYXCorpus and evaluating ML-based pruning under conservative training and inference settings. By comparing - and - CGs and leveraging CodeBERT/CodeT5-based semantic features, the study demonstrates substantial precision gains with manageable recall loss, and shows pruned CGs can match or approach the quality of context-sensitive analyses while offering significant speedups and smaller CG sizes. The results emphasize the practical viability of pruning for security analyses, where paranoid configurations can near baseline coverage with up to faster analyses on smaller CGs. The findings highlight the importance of addressing data imbalance, exploring hybrid static analysis approaches, and refining feature engineering to improve both recall and precision in real-world software engineering tasks.

Abstract

Static call graph (CG) construction often over-approximates call relations, leading to sound, but imprecise results. Recent research has explored machine learning (ML)-based CG pruning as a means to enhance precision by eliminating false edges. However, current methods suffer from a limited evaluation dataset, imbalanced training data, and reduced recall, which affects practical downstream analyses. Prior results were also not compared with advanced static CG construction techniques yet. This study tackles these issues. We introduce the NYXCorpus, a dataset of real-world Java programs with high test coverage and we collect traces from test executions and build a ground truth of dynamic CGs. We leverage these CGs to explore conservative pruning strategies during the training and inference of ML-based CG pruners. We conduct a comparative analysis of static CGs generated using zero control flow analysis (0-CFA) and those produced by a context-sensitive 1-CFA algorithm, evaluating both with and without pruning. We find that CG pruning is a difficult task for real-world Java projects and substantial improvements in the CG precision (+25%) meet reduced recall (-9%). However, our experiments show promising results: even when we favor recall over precision by using an F2 metric in our experiments, we can show that pruned CGs have comparable quality to a context-sensitive 1-CFA analysis while being computationally less demanding. Resulting CGs are much smaller (69%), and substantially faster (3.5x speed-up), with virtually unchanged results in our downstream analysis.
Paper Structure (55 sections, 10 equations, 2 figures, 5 tables)

This paper contains 55 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of our approach used in this study
  • Figure 2: Performance of the models with different weights to the positive class (retaining edges)