Table of Contents
Fetching ...

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang

TL;DR

The paper tackles binary decompilation readability and executability by introducing LLM4Decompile, the first large open-source LLM series (1.3B–33B) trained specifically for decompilation. It proposes two complementary approaches: End2end-Decompile for direct binary-to-source translation and Refined-Decompile to enhance outputs from traditional tools like Ghidra, supported by data augmentation, data cleaning, and a two-stage training regime. Empirical results show substantial gains, with End models achieving over 45% re-executability on HumanEval and around 18% on ExeBench, and Refined-Decompile providing an additional 16.2% improvement over End, while maintaining reasonable quality under obfuscation. The work releases code, data, and models, highlighting the potential of LLMs to augment traditional reverse-engineering tools and outlining ethical considerations around obfuscation and misuse.

Abstract

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

LLM4Decompile: Decompiling Binary Code with Large Language Models

TL;DR

The paper tackles binary decompilation readability and executability by introducing LLM4Decompile, the first large open-source LLM series (1.3B–33B) trained specifically for decompilation. It proposes two complementary approaches: End2end-Decompile for direct binary-to-source translation and Refined-Decompile to enhance outputs from traditional tools like Ghidra, supported by data augmentation, data cleaning, and a two-stage training regime. Empirical results show substantial gains, with End models achieving over 45% re-executability on HumanEval and around 18% on ExeBench, and Refined-Decompile providing an additional 16.2% improvement over End, while maintaining reasonable quality under obfuscation. The work releases code, data, and models, highlighting the potential of LLMs to augment traditional reverse-engineering tools and outlining ethical considerations around obfuscation and misuse.

Abstract

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
Paper Structure (42 sections, 1 equation, 8 figures, 11 tables)

This paper contains 42 sections, 1 equation, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Illustration of compiling source code to binary, disassembling binary to assembly code (ASM), and decompiling ASM to pseudo-code with Ghidra. The pseudo-code is hard to read and not executable.
  • Figure 2: End2end-Decompile framework. The source code (SRC) is compiled to binary, disassembled to assembly instructions (ASM), and decompiled by LLM4Decompile to generate SRC'. Loss is computed between SRC and SRC' for training.
  • Figure 3: Refined-Decompile framework. It differs from End2end-Decompile (Figure \ref{['fig:compile']}) only in the LLM's input, which is pseudo-code decompiled from Ghidra.
  • Figure 4: Decompilation results of different approaches. GPT-4o output is plausible yet fail to recover the array dimension (incorrect 2D array arr[outer][inner]). Ghidra's pseudo-code is notably less readable as discussed in Figure \ref{['fig:case']}. GPT-refined Ghidra result (Ghidra+GPT-4o) marginally enhances readability but fails to correctly render for loops and array indexing. Conversely, LLM4Decompile-End and LLM4Decompile-Ref produce accurate and easy-to-read outputs.
  • Figure 5: Decompilation results of GPT-4o on ExeBench test case.
  • ...and 3 more figures