LLM4Decompile: Decompiling Binary Code with Large Language Models
Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang
TL;DR
The paper tackles binary decompilation readability and executability by introducing LLM4Decompile, the first large open-source LLM series (1.3B–33B) trained specifically for decompilation. It proposes two complementary approaches: End2end-Decompile for direct binary-to-source translation and Refined-Decompile to enhance outputs from traditional tools like Ghidra, supported by data augmentation, data cleaning, and a two-stage training regime. Empirical results show substantial gains, with End models achieving over 45% re-executability on HumanEval and around 18% on ExeBench, and Refined-Decompile providing an additional 16.2% improvement over End, while maintaining reasonable quality under obfuscation. The work releases code, data, and models, highlighting the potential of LLMs to augment traditional reverse-engineering tools and outlining ethical considerations around obfuscation and misuse.
Abstract
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile
