Table of Contents
Fetching ...

Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement

Yunlong Feng, Dechuan Teng, Yang Xu, Honglin Mu, Xiao Xu, Libo Qin, Qingfu Zhu, Wanxiang Che

TL;DR

The paper tackles decompilation from assembled machine code back to high-level source by introducing two techniques: Self-Constructed Context Decompilation (sc$^2$dec), a tuning-free method that reuses decompiled results to form effective in-context demonstrations, and Fine-grained Alignment Enhancement (FAE), a targeted fine-tuning strategy using step-by-step and end-to-end objectives guided by debugging information to align assembly with source at the statement level. Together, these methods improve re-executability on the Decompile-Eval benchmark to a new state-of-the-art of $52.41\%$, with an overall gain of about $3.90\%$ over prior approaches. The authors provide a detailed data pipeline, implement a LoRA-based fine-tuning setup on the llm4decompile-6.7b model, and demonstrate that context construction and fine-grained alignment are orthogonal and complementary in boosting decompilation performance. This work advances automated decompilation by leveraging compilability signals and precise assembly-source alignment, with practical impact on software analysis when source code is unavailable.

Abstract

Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc$^2$dec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 3.90% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 52.41%. The code, data, and models are available at https://github.com/AlongWY/sccdec.

Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement

TL;DR

The paper tackles decompilation from assembled machine code back to high-level source by introducing two techniques: Self-Constructed Context Decompilation (scdec), a tuning-free method that reuses decompiled results to form effective in-context demonstrations, and Fine-grained Alignment Enhancement (FAE), a targeted fine-tuning strategy using step-by-step and end-to-end objectives guided by debugging information to align assembly with source at the statement level. Together, these methods improve re-executability on the Decompile-Eval benchmark to a new state-of-the-art of , with an overall gain of about over prior approaches. The authors provide a detailed data pipeline, implement a LoRA-based fine-tuning setup on the llm4decompile-6.7b model, and demonstrate that context construction and fine-grained alignment are orthogonal and complementary in boosting decompilation performance. This work advances automated decompilation by leveraging compilability signals and precise assembly-source alignment, with practical impact on software analysis when source code is unavailable.

Abstract

Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (scdec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 3.90% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 52.41%. The code, data, and models are available at https://github.com/AlongWY/sccdec.
Paper Structure (29 sections, 7 figures, 5 tables)

This paper contains 29 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Pipeline of the decompilation. The input for decompilation tasks is typically assembly, with the source code invisible. Additionally, the source code obtained through decompilation usually does not exactly match the original code.
  • Figure 2: The pipeline for Self-Constructed Context Decompilation operates as follows: when the LLM decompiles and generates compilable code, we compile this code to construct the context, and then use it to decompile the initial assembly code again.
  • Figure 3: An example of step-by-step decompilation is presented, where the training objective requires the model to generate C code progressively after each assembly block. Fine-tuning the model with this objective aids in learning the fine-grained correspondences between assembly and C code.
  • Figure 4: The example for 1-shot learning. In this example, we have tried to cover common control logic such as if statements, loops, and early returns as much as possible. The example will be compiled with the same optimization level as the target assembly code.
  • Figure 5: A case study for sc$^2$dec, which is based on the llm4decompile-6.7b with FAE. By applying the sc$^2$dec method, the model detects and fixes the mismatch between the code and assembly.
  • ...and 2 more figures