Table of Contents
Fetching ...

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly

Jordi Armengol-Estapé, Jackson Woodruff, Chris Cummins, Michael F. P. O'Boyle

TL;DR

SLaDe introduces a portable neural decompiler that combines a 200M‑parameter Transformer trained on real function‑level assembly–to–C data with a type inference engine to recover external types. It employs a novel 8k UnigramLM‑based tokenizer and dropout‑free training, enabling accurate decompilation across ISAs (x86 and ARM) and optimization levels (-O0 and -O3). In large‑scale evaluations on AnghaBench and ExeBench, SLaDe surpasses industrial decompilers (Ghidra) and a general LLM (ChatGPT) in both correctness (IO accuracy) and readability (edit similarity), with up to about 6× and 4× improvements respectively. The results demonstrate the viability of small, targeted neural models for practical, cross‑ISA decompilation and highlight the benefits of integrating neural translation with program analysis via type inference, opening paths for broader portability and robustness in decompilation tasks.

Abstract

Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. However, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code. We develop a novel tokenizer and exploit no-dropout training to produce high-quality code. We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 functions from ExeBench on two ISAs and at two optimizations levels. SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly

TL;DR

SLaDe introduces a portable neural decompiler that combines a 200M‑parameter Transformer trained on real function‑level assembly–to–C data with a type inference engine to recover external types. It employs a novel 8k UnigramLM‑based tokenizer and dropout‑free training, enabling accurate decompilation across ISAs (x86 and ARM) and optimization levels (-O0 and -O3). In large‑scale evaluations on AnghaBench and ExeBench, SLaDe surpasses industrial decompilers (Ghidra) and a general LLM (ChatGPT) in both correctness (IO accuracy) and readability (edit similarity), with up to about 6× and 4× improvements respectively. The results demonstrate the viability of small, targeted neural models for practical, cross‑ISA decompilation and highlight the benefits of integrating neural translation with program analysis via type inference, opening paths for broader portability and robustness in decompilation tasks.

Abstract

Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. However, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code. We develop a novel tokenizer and exploit no-dropout training to produce high-quality code. We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 functions from ExeBench on two ISAs and at two optimizations levels. SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.
Paper Structure (47 sections, 11 equations, 11 figures, 1 table)

This paper contains 47 sections, 11 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Comparing decompilation techniques to the ground-truth. We compile the original code (box 2) using GCC O3 and then decompile with each technique. As BTC was trained on O0 code, we use O0 to evaluate it. We can see that Ghidra (box 1) and ChatGPT (box 3) produce very difficult to read code and in ChatGPT's case, the code is incorrect, adding two arrays together rather than adding a constant to an array. BTC (box 5) produces significantly more readable, but also incorrect, code. SLaDe (box 6) produces readable, correct code.
  • Figure 2: We train a small Transformer to minimize the cross-entropy loss function. At inference time, we use the model to generate code. Generated code with missing typedefs is passed to PsycheC Melo2018 to generate candidate types. We then check the inputs for correctness using input/output examples.
  • Figure 3: Algorithm for computing the edit-distance between two sequences. We use $\varepsilon$ to represent the empty sequence. We use edit similarity, which is $1 - \text{Edit Distance} / \text{Sequence Length}$, so that a higher edit similarity represents better readability.
  • Figure 4: ExeBench, x86: -O0 (left) -O3 (right), input-output (IO) accuracy and edit similarity. A decompiled program is IO accurate if it gives the same outputs for the same range of inputs as the original assembly. BTC's edit distance is as-repored in Hosseini2022 on a different dataset. Its dataset is restricted to -O0 and does not support evaluations of correctness, so omitted. SLaDe out-performs existing techniques, producing 1.17x to 3.83x more correct code, and a higher edit similarity than Ghidra, ChatGPT and BTC. .
  • Figure 5: ExeBench, x86: -O0 (left) -O3(right), IO accuracy and edit similarity. SLaDe significantly out-performs existing techniques producing 2.2x to 6.32x more accurate code and a higher edit similarity.
  • ...and 6 more figures