Table of Contents
Fetching ...

Superset Decompilation

Chang Liu, Yihao Sun, Thomas Gilray, Kristopher Micinski

Abstract

Reverse engineering tools remain monolithic and imperative compared to the advancement of modern compiler architectures: analyses are tied to a single mutable representation, making them difficult to extend or refine, and forcing premature choices between soundness and precision. We observe that decompilation is the reverse of compilation and can be structured as a sequence of modular passes, each performing a granular and clearly defined interpretation of the binary at a progressively higher level of abstraction. We formalize this as provenance-guided superset decompilation (PGSD), a framework that monotonically derives facts about the binary into a relation store. Instead of committing early to a single interpretation, the pipeline retains ambiguous interpretations as parallel candidates with provenance, deferring resolution until the final selection phase. Manifold implements PGSD as a declarative reverse engineering framework that lifts Linux ELF binaries to C99 through a granular intermediate representation in ~35K lines of Rust and Datalog. On GNU coreutils, Manifold's output quality matches Ghidra, IDA Pro, angr, and RetDec on multiple metrics while producing fewer compiler errors, and generalizes across compilers and optimization levels.

Superset Decompilation

Abstract

Reverse engineering tools remain monolithic and imperative compared to the advancement of modern compiler architectures: analyses are tied to a single mutable representation, making them difficult to extend or refine, and forcing premature choices between soundness and precision. We observe that decompilation is the reverse of compilation and can be structured as a sequence of modular passes, each performing a granular and clearly defined interpretation of the binary at a progressively higher level of abstraction. We formalize this as provenance-guided superset decompilation (PGSD), a framework that monotonically derives facts about the binary into a relation store. Instead of committing early to a single interpretation, the pipeline retains ambiguous interpretations as parallel candidates with provenance, deferring resolution until the final selection phase. Manifold implements PGSD as a declarative reverse engineering framework that lifts Linux ELF binaries to C99 through a granular intermediate representation in ~35K lines of Rust and Datalog. On GNU coreutils, Manifold's output quality matches Ghidra, IDA Pro, angr, and RetDec on multiple metrics while producing fewer compiler errors, and generalizes across compilers and optimization levels.

Paper Structure

This paper contains 15 sections, 4 theorems, 10 equations, 9 figures, 4 tables.

Key Result

proposition 1

For every commutative semiring $K$ and valuation $v:X\to K$, there is a unique semiring homomorphism $h_v:\mathbb{N}[X]\to K$ extending $v$. Moreover, for any positive Datalog program $P$ and any $\mathbb{N}[X]$-annotated input database $I$, where ${P}^{K}$ denotes the semantics of $P$ evaluated over $K$.

Figures (9)

  • Figure 1: End-to-end overview of Manifold.
  • Figure 2: Function and struct-related statistics from decompilers. IDA and Ghidra do not recover struct by default. IDA types are sanitized to match C types.
  • Figure 3: Per-binary CodeBLEU score distributions across decompilers.
  • Figure 4: Decompiler metrics on different compilers/optimization levels
  • Figure 5: Running time for decompilers. Manifold parallelizes via Rayon across all available cores; angr parallelizes per function across all available cores.
  • ...and 4 more figures

Theorems & Definitions (16)

  • definition 1: Commutative Semiring
  • definition 2: $K$-Relation
  • definition 3: Provenance Polynomial Semiring
  • proposition 1: Universality
  • definition 4: IR Hierarchy
  • definition 5: Nodes
  • definition 6: Annotated Relation Store
  • definition 7: Statement Candidates
  • definition 8: Pass
  • definition 9: Immediate Consequence Operator
  • ...and 6 more