Table of Contents
Fetching ...

The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs

Peipei Liu, Jian Sun, Rongkang Sun, Li Chen, Zhaoteng Yan, Peizheng Zhang, Dapeng Sun, Dawei Wang, Xiaoling Zhang, Dan Li

TL;DR

The paper tackles the bottlenecks of end-to-end LLM-based binary decompilation by introducing the CodeInverter Suite, which fuses control-flow reasoning with explicit data-context through CFG- and data-mapping augmented prompts. It presents the CIW workflow, the CID dataset with 8.69 million CFG- and data-mapped samples, and the CIMs (1.3B and 6.7B) trained on CID to enable efficient, privacy-preserving inference. Empirical results demonstrate substantial gains in re-executability, structural readability, and overall decompilation quality, with CIM-6.7B + CIW achieving state-of-the-art performance on key benchmarks and outperforming larger baselines. The work provides an open-source pathway for high-quality, locally deployable binary decompilation and highlights avenues for future enhancements in data-flow modeling and robustness to compiler optimizations.

Abstract

Binary decompilation plays a vital role in various cybersecurity and software engineering tasks. Recently, end-to-end decompilation methods powered by large language models (LLMs) have garnered significant attention due to their ability to generate highly readable source code with minimal human intervention. However, existing LLM-based approaches face several critical challenges, including limited capability in reconstructing code structure and logic, low accuracy in data recovery, concerns over data security and privacy, and high computational resource requirements. To address these issues, we develop the CodeInverter Suite, making three contributions: (1) the CodeInverter Workflow (CIW) is a novel prompt engineering workflow that incorporates control flow graphs (CFG) and explicit data mappings to improve LLM-based decompilation. (2) Using CIW on well-known source code datasets, we curate the CodeInverter Dataset (CID), a domain-specific dataset containing 8.69 million samples that contains CFGs and data mapping tables. (3) We train the CoderInverter Models (CIMs) on CID, generating two lightweight LLMs (with 1.3B and 6.7B parameters) intended for efficient inference in privacy-sensitive or resource-constrained environments. Extensive experiments on two benchmarks demonstrate that the CIW substantially enhances the performance of various LLMs across multiple metrics. Our CIM-6.7B can achieve state-of-the-art decompilation performance, outperforming existing LLMs even with over 100x more parameters in decompilation tasks, an average improvement of 11.03% in re-executability, 6.27% in edit similarity.

The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs

TL;DR

The paper tackles the bottlenecks of end-to-end LLM-based binary decompilation by introducing the CodeInverter Suite, which fuses control-flow reasoning with explicit data-context through CFG- and data-mapping augmented prompts. It presents the CIW workflow, the CID dataset with 8.69 million CFG- and data-mapped samples, and the CIMs (1.3B and 6.7B) trained on CID to enable efficient, privacy-preserving inference. Empirical results demonstrate substantial gains in re-executability, structural readability, and overall decompilation quality, with CIM-6.7B + CIW achieving state-of-the-art performance on key benchmarks and outperforming larger baselines. The work provides an open-source pathway for high-quality, locally deployable binary decompilation and highlights avenues for future enhancements in data-flow modeling and robustness to compiler optimizations.

Abstract

Binary decompilation plays a vital role in various cybersecurity and software engineering tasks. Recently, end-to-end decompilation methods powered by large language models (LLMs) have garnered significant attention due to their ability to generate highly readable source code with minimal human intervention. However, existing LLM-based approaches face several critical challenges, including limited capability in reconstructing code structure and logic, low accuracy in data recovery, concerns over data security and privacy, and high computational resource requirements. To address these issues, we develop the CodeInverter Suite, making three contributions: (1) the CodeInverter Workflow (CIW) is a novel prompt engineering workflow that incorporates control flow graphs (CFG) and explicit data mappings to improve LLM-based decompilation. (2) Using CIW on well-known source code datasets, we curate the CodeInverter Dataset (CID), a domain-specific dataset containing 8.69 million samples that contains CFGs and data mapping tables. (3) We train the CoderInverter Models (CIMs) on CID, generating two lightweight LLMs (with 1.3B and 6.7B parameters) intended for efficient inference in privacy-sensitive or resource-constrained environments. Extensive experiments on two benchmarks demonstrate that the CIW substantially enhances the performance of various LLMs across multiple metrics. Our CIM-6.7B can achieve state-of-the-art decompilation performance, outperforming existing LLMs even with over 100x more parameters in decompilation tasks, an average improvement of 11.03% in re-executability, 6.27% in edit similarity.

Paper Structure

This paper contains 18 sections, 6 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Motivating example. Presented are the source code (a) of the case sample alongside the decompilations of Hex-Rays (b), CIM without CFG (c), CIM without data mapping (d) and CIM (e).
  • Figure 2: The decompilation workflow with LLMs in this paper (top: inference, bottom: training).
  • Figure 3: Prompt engineering details of our proposed CIW (the resulting prompt for all LLMs)
  • Figure 4: An example of readable decompilation. Presented are the source code (a) of the case sample alongside the decompilations of CIM-6.7B with CIW (b), DeepSeek-V3 with CIW (c), GPT-4o with CIW (d), and Hex-Rays (e).
  • Figure 5: Qualitative example of the decompilations made by CIM with CIW and its variations that ablate different components. Presented are the source code (a) of the case sample alongside the decompilations of CIM-6.7B + CIW (b), CIM-6.7B + CIW w/o CFG (c), and CIM-6.7B + CIW w/o DM (d).