Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering
Saman Pordanesh, Benjamin Tan
TL;DR
The paper assesses GPT-4's viability for Binary Reverse Engineering by testing its ability to interpret and explain both human-written and decompiled code across two phases: basic code interpretation and malware analysis. It employs two datasets (70 simple C problems and 15 malware C sources) and uses Ghidra/RetDec for decompilation, with both automated BLEU-based metrics and manual rubrics for evaluation. Findings show limited reliability of automated metrics and variable performance across tasks, with GPT-4 better at broad functionality and function-name generation but weaker in detailed security analysis and nuanced code relationships. The work highlights the need for specialized reverse-engineering datasets and evaluation benchmarks, and outlines future directions including expert-annotated benchmarks and open-source model comparisons to advance AI-assisted reverse engineering.
Abstract
This study investigates the capabilities of Large Language Models (LLMs), specifically GPT-4, in the context of Binary Reverse Engineering (RE). Employing a structured experimental approach, we analyzed the LLM's performance in interpreting and explaining human-written and decompiled codes. The research encompassed two phases: the first on basic code interpretation and the second on more complex malware analysis. Key findings indicate LLMs' proficiency in general code understanding, with varying effectiveness in detailed technical and security analyses. The study underscores the potential and current limitations of LLMs in reverse engineering, revealing crucial insights for future applications and improvements. Also, we examined our experimental methodologies, such as methods of evaluation and data constraints, which provided us with a technical vision for any future research activity in this field.
