Boosting Pointer Analysis With LLM-Enhanced Allocation Function Detection
Baijun Cheng, Kailong Wang, Ling Shi, Haoyu Wang, Peng Di, Ding Li, Xiangqun Chen, Yao Guo
TL;DR
This work tackles imprecision in C/C++ pointer analysis caused by custom allocation functions by introducing CAFD, a lightweight, LLM-assisted pre-analysis that identifies side-effect-free CAFs (S-CAFs) and models them as distinct heap sources. By combining simple value-flow tracking with targeted LLM reasoning, CAFD achieves broad S-CAF coverage with minimal overhead, yielding context-sensitivity-like precision enhancements without full-context costs. Empirical results across 17 real-world projects show substantial gains: about 38x more modeled heap objects, ~41% smaller alias sets, and ~1.4x runtime overhead, along with improved indirect-call resolution and 29 newly discovered memory bugs. The approach demonstrates that precise CAF identification is a scalable and practical path to enhancing pointer analysis in large software systems, and the authors provide open-source toolchains for reproducibility.
Abstract
Pointer analysis is foundational for many static analysis tasks, yet its effectiveness is often hindered by imprecise modeling of heap allocations, particularly in C/C++ programs where custom allocation functions (CAFs) are pervasive. Existing approaches largely overlook these custom allocators, leading to coarse aliasing and low analysis precision. In this paper, we present CAFD, a novel and lightweight technique that enhances pointer analysis by automatically detecting side-effect-free custom allocation functions. CAFD employs a hybrid approach: it uses value-flow analysis to detect straightforward wrappers and leverages Large Language Models (LLMs) to reason about more complex allocation patterns with side effects, ensuring that only side-effect-free functions are modeled as allocators. This targeted enhancement enables precise modeling of heap objects at each call site, achieving context-sensitivity-like benefits without significant overhead. We evaluated CAFD on 17 real-world C projects, identifying over 700 CAFs. Integrating CAFD into a baseline pointer analysis yields a 38x increase in modeled heap objects and a 41.5% reduction in alias set sizes, with only 1.4x runtime overhead. Furthermore, the LLM-enhanced pointer analysis improves indirect call resolution and discovers 29 previously undetected memory bugs, including 6 from real-world industrial applications. These results demonstrate that precise modeling of CAFs has the capability to offer a scalable and practical path to improve pointer analysis in large software systems.
