Table of Contents
Fetching ...

ReF Decompile: Relabeling and Function Call Enhanced Decompile

Yunlong Feng, Bohan Li, Xiaoming Shi, Qingfu Zhu, Wanxiang Che

TL;DR

Decompilation from binary code to high-level languages is hampered by information loss in end-to-end LLM methods, especially regarding control flow and variable data. ReF Decompile addresses this by introducing Relabeling to preserve jump semantics and Function Call to retrieve missing variable information from rodata, forming a robust end-to-end decompilation framework. Evaluated on the Humaneval-based Decompile-Eval benchmark, the approach achieves a state-of-the-art $61.43\%$ re-executability and $3.69$ readability, outperforming refine-based and other end-to-end baselines. Ablation and dataset analyses corroborate that the two strategies jointly enhance accuracy and generality, suggesting strong practical impact for reverse engineering, vulnerability analysis, and legacy software migration.

Abstract

The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages, enabling analysis in scenarios where source code is unavailable. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration. The end-to-end decompile method based on large langauge models (LLMs) reduces reliance on additional tools and minimizes manual intervention due to its inherent properties. However, previous end-to-end methods often lose critical information necessary for reconstructing control flow structures and variables when processing binary files, making it challenging to accurately recover the program's logic. To address these issues, we propose the \textbf{ReF Decompile} method, which incorporates the following innovations: (1) The Relabelling strategy replaces jump target addresses with labels, preserving control flow clarity. (2) The Function Call strategy infers variable types and retrieves missing variable information from binary files. Experimental results on the Humaneval-Decompile Benchmark demonstrate that ReF Decompile surpasses comparable baselines and achieves state-of-the-art (SOTA) performance of $61.43\%$.

ReF Decompile: Relabeling and Function Call Enhanced Decompile

TL;DR

Decompilation from binary code to high-level languages is hampered by information loss in end-to-end LLM methods, especially regarding control flow and variable data. ReF Decompile addresses this by introducing Relabeling to preserve jump semantics and Function Call to retrieve missing variable information from rodata, forming a robust end-to-end decompilation framework. Evaluated on the Humaneval-based Decompile-Eval benchmark, the approach achieves a state-of-the-art re-executability and readability, outperforming refine-based and other end-to-end baselines. Ablation and dataset analyses corroborate that the two strategies jointly enhance accuracy and generality, suggesting strong practical impact for reverse engineering, vulnerability analysis, and legacy software migration.

Abstract

The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages, enabling analysis in scenarios where source code is unavailable. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration. The end-to-end decompile method based on large langauge models (LLMs) reduces reliance on additional tools and minimizes manual intervention due to its inherent properties. However, previous end-to-end methods often lose critical information necessary for reconstructing control flow structures and variables when processing binary files, making it challenging to accurately recover the program's logic. To address these issues, we propose the \textbf{ReF Decompile} method, which incorporates the following innovations: (1) The Relabelling strategy replaces jump target addresses with labels, preserving control flow clarity. (2) The Function Call strategy infers variable types and retrieves missing variable information from binary files. Experimental results on the Humaneval-Decompile Benchmark demonstrate that ReF Decompile surpasses comparable baselines and achieves state-of-the-art (SOTA) performance of .

Paper Structure

This paper contains 24 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: LLM-based decompilation methods can primarily be categorized into two types: (1) refine-based methods, which aim to refine the pseudo code generated by decompilers such as Ghidra to recover the original code; and (2) end-to-end methods, which aim to reconstruct the original code directly from assembly code.
  • Figure 2: Comparsion of the previous method and our method (ReF Decompile). Previous end-to-end methods rely solely on information from the executable segment, leading to "information loss" during decompilation. For example, the processed assembly here lost variable information (“3.14” in the source code) and the jump target (“1109” of “jbe 1109” in the raw assembly). This results in code reconstructions that appeared plausible but are actually incorrect. By incorporating Relabeling information (Relabeling) and leveraging relevant tools (Function Call), the model can now gain a deeper understanding of code jump logic and access valuable information stored outside the executable segment. This enhancement allows the model to accurately reconstruct the original code, significantly improving the precision and reliability of the decompilation process.
  • Figure 3: The processing details of Relabeling.
  • Figure 4: Overview of the Data Construction for Function Call.
  • Figure 5: Re-executability rate comparison between ReF Decompile and LLM4Decompile-Ref models of varying sizes (1.3B, 6.7B, and 22B parameters).