Table of Contents
Fetching ...

Forklift: An Extensible Neural Lifter

Jordi Armengol-Estapé, Rodrigo C. O. Rocha, Jackson Woodruff, Pasquale Minervini, Michael F. P. O'Boyle

TL;DR

Forklift tackles the problem of porting binary code across diverse ISAs by learning to lift assembly directly to LLVM IR, an IR that can be compiled to many target architectures. The authors propose an extensible, incremental learning framework that uses a fixed LLVM IR decoder and per-ISA encoders, enabling new ISAs to be added with minimal retraining. They train on a million-scale parallel dataset of LLVM IR and assembly across x86, ARM, and RISC‑V, with an IO-based accuracy harness to evaluate translations. Empirically, Forklift outperforms a state-of-the-art hand-written lifter and GPT-4 on two benchmarks, and demonstrates superior scalability and adaptability to new compilers and ISAs, highlighting practical viability for cross-ISA software porting and optimization workflows.

Abstract

The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages. However, the development of these tools requires substantial engineering effort. State-of-the-art approaches use lifting, a technique where source assembly code is translated to an architecture-independent intermediate representation (IR) (for example, the LLVM IR) and use a pre-existing compiler to recompile the IR to the target ISA. However, the hand-written rules these lifters employ are sensitive to the particular compiler and optimization level used to generate the code and require significant engineering effort to support each new ISA. We propose Forklift, the first neural lifter that learns how to translate assembly to LLVM IR using a token-level encoder-decoder Transformer. We show how to incrementally add support to new ISAs by fine tuning the assembly encoder and freezing the IR decoder, improving the overall accuracy and efficiency. We collect millions of parallel LLVM IR, x86, ARM, and RISC-V programs across compilers and optimization levels to train Forklift and set up an input/output-based accuracy harness. We evaluate Forklift on two challenging benchmark suites and translate 2.5x more x86 programs than a state-of-the-art hand-written lifter and 4.4x more x86 programs than GPT-4 as well as enabling translation from new ISAs.

Forklift: An Extensible Neural Lifter

TL;DR

Forklift tackles the problem of porting binary code across diverse ISAs by learning to lift assembly directly to LLVM IR, an IR that can be compiled to many target architectures. The authors propose an extensible, incremental learning framework that uses a fixed LLVM IR decoder and per-ISA encoders, enabling new ISAs to be added with minimal retraining. They train on a million-scale parallel dataset of LLVM IR and assembly across x86, ARM, and RISC‑V, with an IO-based accuracy harness to evaluate translations. Empirically, Forklift outperforms a state-of-the-art hand-written lifter and GPT-4 on two benchmarks, and demonstrates superior scalability and adaptability to new compilers and ISAs, highlighting practical viability for cross-ISA software porting and optimization workflows.

Abstract

The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages. However, the development of these tools requires substantial engineering effort. State-of-the-art approaches use lifting, a technique where source assembly code is translated to an architecture-independent intermediate representation (IR) (for example, the LLVM IR) and use a pre-existing compiler to recompile the IR to the target ISA. However, the hand-written rules these lifters employ are sensitive to the particular compiler and optimization level used to generate the code and require significant engineering effort to support each new ISA. We propose Forklift, the first neural lifter that learns how to translate assembly to LLVM IR using a token-level encoder-decoder Transformer. We show how to incrementally add support to new ISAs by fine tuning the assembly encoder and freezing the IR decoder, improving the overall accuracy and efficiency. We collect millions of parallel LLVM IR, x86, ARM, and RISC-V programs across compilers and optimization levels to train Forklift and set up an input/output-based accuracy harness. We evaluate Forklift on two challenging benchmark suites and translate 2.5x more x86 programs than a state-of-the-art hand-written lifter and 4.4x more x86 programs than GPT-4 as well as enabling translation from new ISAs.
Paper Structure (42 sections, 3 equations, 3 figures, 18 tables)

This paper contains 42 sections, 3 equations, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Forkliftlifts source x86, ARM, RISC-V code into the intermediate representations (IR) of the LLVM compiler. We leverage the LLVM's ability to compile to a wide range of target ISAs. As the target of Forklift is always LLVM IR, we freeze the decoder and fine-tune existing encoders to incrementally add support for new sources, maintaining accuracy on prior sources.
  • Figure 2: Input/output accuracy on different stages in the training, on ExeBench for ARM.
  • Figure 3: Input/output accuracy on different stages in the training, on Synth for ARM.