Table of Contents
Fetching ...

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, Min Yang

TL;DR

This work addresses the challenge of automatically building C/C++ OSS, a task hampered by diverse build systems and complex dependency management. It introduces CXXCrafter, an LLM-based agent with Parser, Generator, and Executor modules that iteratively parse repositories, generate Dockerfile-based build solutions, and execute builds within containers, guided by rich prompts and retrieval-augmented information. Empirical evaluation on Top100 and Awesome-CPP demonstrates a 78% overall success rate, with notable advantages over default build commands and bare LLMs, including three cases where automation resolves issues humans cannot. The study provides deep insights into common build failure causes, presents a flexible, reusable architecture for LLM-driven build automation, and outlines practical implications for downstream program analysis and vulnerability reproduction.

Abstract

Project building is pivotal to support various program analysis tasks, such as generating intermediate rep- resentation code for static analysis and preparing binary code for vulnerability reproduction. However, automating the building process for C/C++ projects is a highly complex endeavor, involving tremendous technical challenges, such as intricate dependency management, diverse build systems, varied toolchains, and multifaceted error handling mechanisms. Consequently, building C/C++ projects often proves to be difficult in practice, hindering the progress of downstream applications. Unfortunately, research on facilitating the building of C/C++ projects remains to be inadequate. The emergence of Large Language Models (LLMs) offers promising solutions to automated software building. Trained on extensive corpora, LLMs can help unify diverse build systems through their comprehension capabilities and address complex errors by leveraging tacit knowledge storage. Moreover, LLM-based agents can be systematically designed to dynamically interact with the environment, effectively managing dynamic building issues. Motivated by these opportunities, we first conduct an empirical study to systematically analyze the current challenges in the C/C++ project building process. Particularly, we observe that most popular C/C++ projects encounter an average of five errors when relying solely on the default build systems. Based on our study, we develop an automated build system called CXXCrafter to specifically address the above-mentioned challenges, such as dependency resolution. Our evaluation on open-source software demonstrates that CXXCrafter achieves a success rate of 78% in project building. Specifically, among the Top100 dataset, 72 projects are built successfully by both CXXCrafter and manual efforts, 3 by CXXCrafter only, and 14 manually only. ...

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

TL;DR

This work addresses the challenge of automatically building C/C++ OSS, a task hampered by diverse build systems and complex dependency management. It introduces CXXCrafter, an LLM-based agent with Parser, Generator, and Executor modules that iteratively parse repositories, generate Dockerfile-based build solutions, and execute builds within containers, guided by rich prompts and retrieval-augmented information. Empirical evaluation on Top100 and Awesome-CPP demonstrates a 78% overall success rate, with notable advantages over default build commands and bare LLMs, including three cases where automation resolves issues humans cannot. The study provides deep insights into common build failure causes, presents a flexible, reusable architecture for LLM-driven build automation, and outlines practical implications for downstream program analysis and vulnerability reproduction.

Abstract

Project building is pivotal to support various program analysis tasks, such as generating intermediate rep- resentation code for static analysis and preparing binary code for vulnerability reproduction. However, automating the building process for C/C++ projects is a highly complex endeavor, involving tremendous technical challenges, such as intricate dependency management, diverse build systems, varied toolchains, and multifaceted error handling mechanisms. Consequently, building C/C++ projects often proves to be difficult in practice, hindering the progress of downstream applications. Unfortunately, research on facilitating the building of C/C++ projects remains to be inadequate. The emergence of Large Language Models (LLMs) offers promising solutions to automated software building. Trained on extensive corpora, LLMs can help unify diverse build systems through their comprehension capabilities and address complex errors by leveraging tacit knowledge storage. Moreover, LLM-based agents can be systematically designed to dynamically interact with the environment, effectively managing dynamic building issues. Motivated by these opportunities, we first conduct an empirical study to systematically analyze the current challenges in the C/C++ project building process. Particularly, we observe that most popular C/C++ projects encounter an average of five errors when relying solely on the default build systems. Based on our study, we develop an automated build system called CXXCrafter to specifically address the above-mentioned challenges, such as dependency resolution. Our evaluation on open-source software demonstrates that CXXCrafter achieves a success rate of 78% in project building. Specifically, among the Top100 dataset, 72 projects are built successfully by both CXXCrafter and manual efforts, 3 by CXXCrafter only, and 14 manually only. ...

Paper Structure

This paper contains 23 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The Statistics of Build Tools used in the Top 100 and Awesome-CPP Datasets (introduced in Section \ref{['sec:evaluation']}).
  • Figure 2: The Overall Framework of CXXCrafter
  • Figure 3: The Parser Module. It is responsible for automatically extracting and analyzing information such as dependencies, build system information, which is crucial for subsequent build processes.
  • Figure 4: Prompt of the Generator Module. The prompts corresponding to numbers 3, 4, and 5 are from the Parser’s results, with blue text indicating dynamically parsed content.
  • Figure 5: Number of Successful Builds of CXXCrafter Variants on the Top100 Dataset.