Table of Contents
Fetching ...

Boosting Pointer Analysis With LLM-Enhanced Allocation Function Detection

Baijun Cheng, Kailong Wang, Ling Shi, Haoyu Wang, Peng Di, Ding Li, Xiangqun Chen, Yao Guo

TL;DR

This work tackles imprecision in C/C++ pointer analysis caused by custom allocation functions by introducing CAFD, a lightweight, LLM-assisted pre-analysis that identifies side-effect-free CAFs (S-CAFs) and models them as distinct heap sources. By combining simple value-flow tracking with targeted LLM reasoning, CAFD achieves broad S-CAF coverage with minimal overhead, yielding context-sensitivity-like precision enhancements without full-context costs. Empirical results across 17 real-world projects show substantial gains: about 38x more modeled heap objects, ~41% smaller alias sets, and ~1.4x runtime overhead, along with improved indirect-call resolution and 29 newly discovered memory bugs. The approach demonstrates that precise CAF identification is a scalable and practical path to enhancing pointer analysis in large software systems, and the authors provide open-source toolchains for reproducibility.

Abstract

Pointer analysis is foundational for many static analysis tasks, yet its effectiveness is often hindered by imprecise modeling of heap allocations, particularly in C/C++ programs where custom allocation functions (CAFs) are pervasive. Existing approaches largely overlook these custom allocators, leading to coarse aliasing and low analysis precision. In this paper, we present CAFD, a novel and lightweight technique that enhances pointer analysis by automatically detecting side-effect-free custom allocation functions. CAFD employs a hybrid approach: it uses value-flow analysis to detect straightforward wrappers and leverages Large Language Models (LLMs) to reason about more complex allocation patterns with side effects, ensuring that only side-effect-free functions are modeled as allocators. This targeted enhancement enables precise modeling of heap objects at each call site, achieving context-sensitivity-like benefits without significant overhead. We evaluated CAFD on 17 real-world C projects, identifying over 700 CAFs. Integrating CAFD into a baseline pointer analysis yields a 38x increase in modeled heap objects and a 41.5% reduction in alias set sizes, with only 1.4x runtime overhead. Furthermore, the LLM-enhanced pointer analysis improves indirect call resolution and discovers 29 previously undetected memory bugs, including 6 from real-world industrial applications. These results demonstrate that precise modeling of CAFs has the capability to offer a scalable and practical path to improve pointer analysis in large software systems.

Boosting Pointer Analysis With LLM-Enhanced Allocation Function Detection

TL;DR

This work tackles imprecision in C/C++ pointer analysis caused by custom allocation functions by introducing CAFD, a lightweight, LLM-assisted pre-analysis that identifies side-effect-free CAFs (S-CAFs) and models them as distinct heap sources. By combining simple value-flow tracking with targeted LLM reasoning, CAFD achieves broad S-CAF coverage with minimal overhead, yielding context-sensitivity-like precision enhancements without full-context costs. Empirical results across 17 real-world projects show substantial gains: about 38x more modeled heap objects, ~41% smaller alias sets, and ~1.4x runtime overhead, along with improved indirect-call resolution and 29 newly discovered memory bugs. The approach demonstrates that precise CAF identification is a scalable and practical path to enhancing pointer analysis in large software systems, and the authors provide open-source toolchains for reproducibility.

Abstract

Pointer analysis is foundational for many static analysis tasks, yet its effectiveness is often hindered by imprecise modeling of heap allocations, particularly in C/C++ programs where custom allocation functions (CAFs) are pervasive. Existing approaches largely overlook these custom allocators, leading to coarse aliasing and low analysis precision. In this paper, we present CAFD, a novel and lightweight technique that enhances pointer analysis by automatically detecting side-effect-free custom allocation functions. CAFD employs a hybrid approach: it uses value-flow analysis to detect straightforward wrappers and leverages Large Language Models (LLMs) to reason about more complex allocation patterns with side effects, ensuring that only side-effect-free functions are modeled as allocators. This targeted enhancement enables precise modeling of heap objects at each call site, achieving context-sensitivity-like benefits without significant overhead. We evaluated CAFD on 17 real-world C projects, identifying over 700 CAFs. Integrating CAFD into a baseline pointer analysis yields a 38x increase in modeled heap objects and a 41.5% reduction in alias set sizes, with only 1.4x runtime overhead. Furthermore, the LLM-enhanced pointer analysis improves indirect call resolution and discovers 29 previously undetected memory bugs, including 6 from real-world industrial applications. These results demonstrate that precise modeling of CAFs has the capability to offer a scalable and practical path to improve pointer analysis in large software systems.

Paper Structure

This paper contains 30 sections, 3 equations, 8 figures, 19 tables, 1 algorithm.

Figures (8)

  • Figure 1: Examples of allocation wrapper from bash. CI denots "context-insensitive" (base Andersen), CS denotes "context-sensitive".
  • Figure 2: C-CAFs from linux and nginx.
  • Figure 3: Overview of CAFD.
  • Figure 4: Prompts used for querying LLM.
  • Figure 5: An Illustrative Example of CAFD.
  • ...and 3 more figures

Theorems & Definitions (2)

  • definition 1: S-CAFs: Side-effect–free CAFs
  • definition 2: C-CAFs: Complex CAFs