Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering

Saman Pordanesh; Benjamin Tan

Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering

Saman Pordanesh, Benjamin Tan

TL;DR

The paper assesses GPT-4's viability for Binary Reverse Engineering by testing its ability to interpret and explain both human-written and decompiled code across two phases: basic code interpretation and malware analysis. It employs two datasets (70 simple C problems and 15 malware C sources) and uses Ghidra/RetDec for decompilation, with both automated BLEU-based metrics and manual rubrics for evaluation. Findings show limited reliability of automated metrics and variable performance across tasks, with GPT-4 better at broad functionality and function-name generation but weaker in detailed security analysis and nuanced code relationships. The work highlights the need for specialized reverse-engineering datasets and evaluation benchmarks, and outlines future directions including expert-annotated benchmarks and open-source model comparisons to advance AI-assisted reverse engineering.

Abstract

This study investigates the capabilities of Large Language Models (LLMs), specifically GPT-4, in the context of Binary Reverse Engineering (RE). Employing a structured experimental approach, we analyzed the LLM's performance in interpreting and explaining human-written and decompiled codes. The research encompassed two phases: the first on basic code interpretation and the second on more complex malware analysis. Key findings indicate LLMs' proficiency in general code understanding, with varying effectiveness in detailed technical and security analyses. The study underscores the potential and current limitations of LLMs in reverse engineering, revealing crucial insights for future applications and improvements. Also, we examined our experimental methodologies, such as methods of evaluation and data constraints, which provided us with a technical vision for any future research activity in this field.

Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering

TL;DR

Abstract

Paper Structure (34 sections, 5 figures, 6 tables)

This paper contains 34 sections, 5 figures, 6 tables.

Introduction
Overview of the Literature
Our Research Direction
Setups & Tools
Model Selection - GPT4
Tools
Python Scripting and Data Storage
Decompiling Tools - Ghidra and RetDec
Data
Dataset 1 - Simple C Programming Problems
Dataset 2 - Malware Source Codes in C
Experiment Design
Phase 1: Basic Code Interpretation
Scenario 1: Original Code Explanation
Scenario 2: Stripped Code Explanation
...and 19 more sections

Figures (5)

Figure 1: Workflow Diagram of Phase 1 - Basic Code Interpretation and Analysis in Large Language Model Experimentation
Figure 2: Workflow Diagram of Phase 2 - Advanced Analysis of Malware-Reversed Engineered Applications Using Large Language Models
Figure 3: BLEU score results for Scenario 1.
Figure 4: BLEU score results for Scenario 2.
Figure 5: BLEU score results for Scenario 3.

Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering

TL;DR

Abstract

Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (5)